The world’s leading publication for data science, AI, and ML professionals.

Introduction to Pandas – Hands-On Tutorial – Part One

This is first of two-part Hands-on tutorial on Pandas for beginners in Data Science and/or Python programming with ample screenshots.

Introduction to Pandas – A Python Library for Data Manipulation and Analysis

Hands-On Tutorial – Part 1

This tutorial is for beginners in Data Science and/or Python programming. It focuses on Exploratory Data Analysis, Data Manipulation and Data Filtering using Pandas library in Python.

Photo by Kunal Kalra on Unsplash
Photo by Kunal Kalra on Unsplash

If you are a Python Developer, who works regularly with Data, I am sure you have heard about and even used Pandas library. This is a hands-on tutorial for beginners in the field of Data Science and/or Python programming. I strongly believe in "Learning by Doing", and hence I feel a hands-on tutorial is the best way of learning something. I will not bore you with much theory, and make sure this tutorial is to the point. My goal is to pave the way for you to explore the inexhaustible world of Data Science.

So, let me start with a brief introduction of Pandas. Pandas is a library that allows you to work with tabular data, time-series data, matrix data and so on. Few great examples of how Pandas makes our life easier are:

  • Importing Data from Comma-Separated Values(CSV) files, JSON files etc.
  • Filling out missing values in the dataset
  • Data Wrangling and Manipulation
  • Merging multiple datasets into one dataset
  • Exporting our results as Comma-Separated Values(CSV) files, JSON files etc.

The possibilities are endless when it comes to what this library has to offer. So without further ado, let’s begin the first tutorial. I have broken down this tutorial into easy-to-follow steps, and in the end I have listed some of the activities, that you can do at your convenience, in order to sharpen your skills.

Step 1 – Install Pandas

First and foremost, let’s start with how install the Pandas library. I am using a Windows 10 PC for this tutorial, but you are free to use any Operating System of your choice(MacOS, Linux and so on). There are many ways we can install Pandas. I will be explaining how I did it personally. But if you would like to do it some other way, then here is the complete set of instructions to install Pandas.

The easiest way to install Pandas is to get it through the Anaconda distribution which is the most popular Python distribution for scientific computing. The Pandas library will come as a part of the distribution. I currently have Python version 3.8.5, but it should work perfectly for Python versions 3.5.3 and higher.

Let’s confirm my statement that Pandas library comes as a part of the Anaconda distribution. If your Anaconda installation was successful, there should be an application called Anaconda Navigator installed on your PC. On the splash page of Anaconda Navigator, there should be an application called "CMD.exe Prompt". Hit "Launch".

Anaconda Navigator Splash Page
Anaconda Navigator Splash Page

Once you hit "Launch", a terminal will open up. This terminal can be used to install/uninstall libraries such as Pandas.

CMD.exe Prompt
CMD.exe Prompt

Once the terminal shows up, type one of the following commands and hit "Enter".

conda install pandas
pip install pandas
Pandas has been already installed!
Pandas has been already installed!

You should see that Pandas is already installed!

Anaconda comes with Integrated development environment (IDE) such as Jupyter Notebook already installed. I will be using Jupyter Notebook for my tutorials, but you are free to choose any IDE that you are comfortable with.

Step 2 – Set up Jupyter Notebook

Now we are ready to deep dive into the world of Pandas. But first I want to show you how to set up the Jupyter Notebook IDE. If you plan to use Jupyter Notebook, for this tutorial follow along. If not, you can skip to Step 3.

Open the "CMD.exe prompt", as shown in Step 1. Then type in the following command and hit "Enter".

jupyter notebook
Using CMD.exe prompt to open Jupyter Notebook
Using CMD.exe prompt to open Jupyter Notebook

Or alternatively, you can search the Jupyter notebook application in the Anaconda Navigator launcher and start Jupyter notebook.

Click "Launch" on Jupyter Notebook
Click "Launch" on Jupyter Notebook

A page will open up on your browser as follows:

Splash page for Jupyter
Splash page for Jupyter

Hit "New" and Select "Python 3" under "Notebook"

And Voila! You should have your own Jupyter Notebook up and running!

First things first! Let’s rename the Notebook something like "Pandas_Tutorial". You are free to name it anything you want.

Now we can start coding! Hooray!

Step 3— Importing data from CSV files in Pandas

If you made it until here, good things are soon to come my friend! Hold tight and get ready to launch!

Let’s start with a very important step: Importing pandas in our Jupyter Notebook.

We can import pandas by typing

import pandas as pd

Using an alias for Pandas such as pd reduces your effort of typing in the further steps, although it’s completely optional.

Importing pandas
Importing pandas

Now let us download a CSV file, that we will be using for this tutorial. I will be using the dataset for Spotify Music Classification. You can download it by signing up/signing in on Kaggle. The reason for which I chose this dataset is that, it has many "features" or "attributes" that describe the songs which we can use to gain some interesting insights regarding them. This process is called Exploratory Data Analysis, in which we will try to find out meaningful information about the dataset such as:

  • Finding Top K "Most Energetic/Least Energetic" songs
  • Finding Top K songs which have the lowest/highest duration
  • Finding Top K "Danceable/Least Danceable" songs

and so on.

We can import this dataset using the following code:

df = pd.read_csv(r"<File Path>data.csv")

The r before the file path is not a Typo, and is a special character that denotes a Raw String. We might get a "FileNotFoundError" on Windows machine, if we don’t specify it.

Pandas uses a Data Structure called a DataFrame to hold this data. A DataFrame allows us to store "relational" or "tabular" data, and gives us many in-built functions in order to manipulate the data.

We can just write the name of the DataFrame to view it as follows:

df
Importing a CSV into a DataFrame and viewing it
Importing a CSV into a DataFrame and viewing it

But often, we don’t want to view the entire data. And we just want a sneak peek of it. Here’s how you can do it.

df.head() #By default it shows the first 5 rows

We can also specify the number of rows in the head() function as follows:

df.head(10) #It will show the first 10 rows
First 5 rows of the DataFrame
First 5 rows of the DataFrame

And what if, we want to view the last few rows.

df.tail() #By default it shows the last 5 rows

We can also specify the number of rows in the tail() function as follows:

df.tail(10) #It will show the last 10 rows
Last 5 rows of the DataFrame
Last 5 rows of the DataFrame

What if we don’t want to view the data, but we want a quick overview of the entire data-frame instead, such as the Datatypes of the columns, number of Null/Non-Null values in the columns and so on. We can use the df.info() function as follows

df.info()
Information about the DataFrame
Information about the DataFrame

From the above screenshot, we can observe that there are 16 columns in the DataFrame. We can also observe that there are 2017 data points. We can also see that there are no Null values, which shows that the data was well maintained. We can also see the Datatypes of all the columns in the DataFrame.

If we are only interested in the Datatypes of the columns, we can check it using df.dtypes

Datatypes for all columns of the Dataframe
Datatypes for all columns of the Dataframe

In this manner, we can view the data and extract meaningful information about the data. If this dataset had Null values, we would have to deal with those missing values too. For Tutorial 2, I am planning to use a Dataset which has Null values, so that I can demonstrate how to solve this problem.

Step 4— Filtering the data in a DataFrame

Filtering is a term which is often used in Data Science. It basically means breaking down the DataFrame into subsets which pass a particular condition.

In this dataset, we can use attributes such as Danceability, Duration etc. to filter out Top K songs that match the particular criteria as follows:

Top 10 "Least Danceable" songs:

We can sort the rows on the basis of the danceabilitycolumn(ascending order by default) and find the first 10 rows and extract data only forsong_title column. This will give us the song_title for Top 10 "Least Danceable" songs.

df.sort_values('danceability')['song_title'].head(10) 
Top 10 "Least Danceable" Songs
Top 10 "Least Danceable" Songs

Suppose we want to view artist names as well since the song titles might not be unique. We can do this by adding artist in the columns we want to see as follows:

df.sort_values('danceability')[['song_title', 'artist']].head(10) 
Top 10 "Least Danceable" Song Names with Artist Names
Top 10 "Least Danceable" Song Names with Artist Names

Top 10 longest songs:

We can sort the rows on the basis of the duration_ms column and find the last 10 rows(since the values are arranged in ascending order by default) and extract data forsong_title, artist, duration_mscolumns as follows:

df.sort_values('duration_ms')[['song_title', 'artist', 'duration_ms']].tail(10)
Songs with longest duration with their duration and artist names
Songs with longest duration with their duration and artist names

What if we want to do a more complex query?

For songs with duration less than 400 seconds, find Top 10 "Most Energetic" songs :

Here we will have to apply two filters.

The first filter being: Finding songs with less than 400 seconds

If we apply the first filter, we can observe that 1925 songs have duration less than 400000 ms (i.e 400 s). So out of the 2017 songs, we need to only look at these 1925 songs for our next filter.

df_filt = df[df['duration_ms'] < 400000] #Our first filter in action

Now we will apply our second filter. We will arrange these songs from the filtered DataFrame (df_filt) in ascending order of energy, and extract last 10 values.

df_filt2 = df_filt.sort_values('energy')[['song_title', 'artist', 'duration_ms']].tail(10)
Top 10 "Most Energetic" Songs having duration less than 400000 ms (400 s)
Top 10 "Most Energetic" Songs having duration less than 400000 ms (400 s)

We have got the result we wanted! But now we observe these values are in ascending order. Since we have asked for Top 10 values, let’s convert them into descending order, by setting the property ascending=False in sort_values()function for better readability. And, just like that, our job is done!

ans = df_filt2.sort_values('energy', ascending=False)
Final answer
Final answer

We could have also arranged the rows in descending order while applying the second filter, and it would have been much more efficient. But I wanted to explain the concepts step-by-step and chose to do it this way.

Step 5— Exporting data as CSV

Suppose we do some data manipulation on a DataFrame and we get the expected results in a DataFrame. If we want to export it as a CSV, to be used somewhere else or to be stored as a back-up, we can do the following:

df_filtered = df.sort_values('duration_ms')[['song_title', 'artist', 'duration_ms']].tail(10)
df_filtered.to_csv(r'<File Path>filtered_data.csv')
Exporting CSV to local
Exporting CSV to local
CSV downloaded to local
CSV downloaded to local

We have barely scratched the surface of all the amazing stuff we can do using this library. There’s a saying that "Practice makes perfect’, so I will urge you to do the following activity before calling it a day! I know you have got it in you! Don’t give up yet!

Activities to sharpen skills

If you require additional practice, you can try doing the following activities:

  • Find songs with value for energy column greater than 0.4 and display values for song_title column.
  • For songs with duration duration_ms value less than 300000 ms (300 s), find Top 10 "fastest" songs( songs with highest values for tempo column ) and display values for song_title, artistcolumns.
  • Find songs with value for artist column = Drake and value for energy column greater than 0.5 and display values for song_title column.
Photo by 蔡 世宏 on Unsplash
Photo by 蔡 世宏 on Unsplash

Wrapping Up

I know you might have been tired and sleepy right now like the Panda above. But I have some good news, we are done with Part 1 of the Tutorial. I plan to post one more tutorial on Pandas, wherein I will introduce how to merge multiple Datasets and I will also perform Information Visualization using Matplotlib, which is yet another fun-filled Python library. I hope you had as much fun reading it, as I had writing it!

Until next time friends!


Related Articles