Introduction to Pandas – A Python Library for Data Manipulation and Analysis
Hands-On Tutorial – Part 1
This tutorial is for beginners in Data Science and/or Python programming. It focuses on Exploratory Data Analysis, Data Manipulation and Data Filtering using Pandas library in Python.

If you are a Python Developer, who works regularly with Data, I am sure you have heard about and even used Pandas library. This is a hands-on tutorial for beginners in the field of Data Science and/or Python programming. I strongly believe in "Learning by Doing", and hence I feel a hands-on tutorial is the best way of learning something. I will not bore you with much theory, and make sure this tutorial is to the point. My goal is to pave the way for you to explore the inexhaustible world of Data Science.
So, let me start with a brief introduction of Pandas. Pandas is a library that allows you to work with tabular data, time-series data, matrix data and so on. Few great examples of how Pandas makes our life easier are:
- Importing Data from Comma-Separated Values(CSV) files, JSON files etc.
- Filling out missing values in the dataset
- Data Wrangling and Manipulation
- Merging multiple datasets into one dataset
- Exporting our results as Comma-Separated Values(CSV) files, JSON files etc.
The possibilities are endless when it comes to what this library has to offer. So without further ado, let’s begin the first tutorial. I have broken down this tutorial into easy-to-follow steps, and in the end I have listed some of the activities, that you can do at your convenience, in order to sharpen your skills.
Step 1 – Install Pandas
First and foremost, let’s start with how install the Pandas library. I am using a Windows 10 PC for this tutorial, but you are free to use any Operating System of your choice(MacOS, Linux and so on). There are many ways we can install Pandas. I will be explaining how I did it personally. But if you would like to do it some other way, then here is the complete set of instructions to install Pandas.
The easiest way to install Pandas is to get it through the Anaconda distribution which is the most popular Python distribution for scientific computing. The Pandas library will come as a part of the distribution. I currently have Python version 3.8.5, but it should work perfectly for Python versions 3.5.3 and higher.
Let’s confirm my statement that Pandas library comes as a part of the Anaconda distribution. If your Anaconda installation was successful, there should be an application called Anaconda Navigator installed on your PC. On the splash page of Anaconda Navigator, there should be an application called "CMD.exe Prompt". Hit "Launch".

Once you hit "Launch", a terminal will open up. This terminal can be used to install/uninstall libraries such as Pandas.

Once the terminal shows up, type one of the following commands and hit "Enter".
conda install pandas
pip install pandas

You should see that Pandas is already installed!
Anaconda comes with Integrated development environment (IDE) such as Jupyter Notebook already installed. I will be using Jupyter Notebook for my tutorials, but you are free to choose any IDE that you are comfortable with.
Step 2 – Set up Jupyter Notebook
Now we are ready to deep dive into the world of Pandas. But first I want to show you how to set up the Jupyter Notebook IDE. If you plan to use Jupyter Notebook, for this tutorial follow along. If not, you can skip to Step 3.
Open the "CMD.exe prompt", as shown in Step 1. Then type in the following command and hit "Enter".
jupyter notebook

Or alternatively, you can search the Jupyter notebook application in the Anaconda Navigator launcher and start Jupyter notebook.

A page will open up on your browser as follows:

Hit "New" and Select "Python 3" under "Notebook"

And Voila! You should have your own Jupyter Notebook up and running!
First things first! Let’s rename the Notebook something like "Pandas_Tutorial". You are free to name it anything you want.

Now we can start coding! Hooray!
Step 3— Importing data from CSV files in Pandas
If you made it until here, good things are soon to come my friend! Hold tight and get ready to launch!
Let’s start with a very important step: Importing pandas in our Jupyter Notebook.
We can import pandas by typing
import pandas as pd
Using an alias for Pandas such as pd reduces your effort of typing in the further steps, although it’s completely optional.

Now let us download a CSV file, that we will be using for this tutorial. I will be using the dataset for Spotify Music Classification. You can download it by signing up/signing in on Kaggle. The reason for which I chose this dataset is that, it has many "features" or "attributes" that describe the songs which we can use to gain some interesting insights regarding them. This process is called Exploratory Data Analysis, in which we will try to find out meaningful information about the dataset such as:
- Finding Top K "Most Energetic/Least Energetic" songs
- Finding Top K songs which have the lowest/highest duration
- Finding Top K "Danceable/Least Danceable" songs
and so on.
We can import this dataset using the following code:
df = pd.read_csv(r"<File Path>data.csv")
The r before the file path is not a Typo, and is a special character that denotes a Raw String. We might get a "FileNotFoundError" on Windows machine, if we don’t specify it.
Pandas uses a Data Structure called a DataFrame to hold this data. A DataFrame allows us to store "relational" or "tabular" data, and gives us many in-built functions in order to manipulate the data.
We can just write the name of the DataFrame to view it as follows:
df

But often, we don’t want to view the entire data. And we just want a sneak peek of it. Here’s how you can do it.
df.head() #By default it shows the first 5 rows
We can also specify the number of rows in the head() function as follows:
df.head(10) #It will show the first 10 rows

And what if, we want to view the last few rows.
df.tail() #By default it shows the last 5 rows
We can also specify the number of rows in the tail() function as follows:
df.tail(10) #It will show the last 10 rows

What if we don’t want to view the data, but we want a quick overview of the entire data-frame instead, such as the Datatypes of the columns, number of Null/Non-Null values in the columns and so on. We can use the df.info() function as follows
df.info()

From the above screenshot, we can observe that there are 16 columns in the DataFrame. We can also observe that there are 2017 data points. We can also see that there are no Null values, which shows that the data was well maintained. We can also see the Datatypes of all the columns in the DataFrame.
If we are only interested in the Datatypes of the columns, we can check it using df.dtypes

In this manner, we can view the data and extract meaningful information about the data. If this dataset had Null values, we would have to deal with those missing values too. For Tutorial 2, I am planning to use a Dataset which has Null values, so that I can demonstrate how to solve this problem.
Step 4— Filtering the data in a DataFrame
Filtering is a term which is often used in Data Science. It basically means breaking down the DataFrame into subsets which pass a particular condition.
In this dataset, we can use attributes such as Danceability, Duration etc. to filter out Top K songs that match the particular criteria as follows:
Top 10 "Least Danceable" songs:
We can sort the rows on the basis of the danceability
column(ascending order by default) and find the first 10 rows and extract data only forsong_title
column. This will give us the song_title
for Top 10 "Least Danceable" songs.
df.sort_values('danceability')['song_title'].head(10)

Suppose we want to view artist names as well since the song titles might not be unique. We can do this by adding artist
in the columns we want to see as follows:
df.sort_values('danceability')[['song_title', 'artist']].head(10)

Top 10 longest songs:
We can sort the rows on the basis of the duration_ms
column and find the last 10 rows(since the values are arranged in ascending order by default) and extract data forsong_title, artist, duration_ms
columns as follows:
df.sort_values('duration_ms')[['song_title', 'artist', 'duration_ms']].tail(10)

What if we want to do a more complex query?
For songs with duration less than 400 seconds, find Top 10 "Most Energetic" songs :
Here we will have to apply two filters.
The first filter being: Finding songs with less than 400 seconds
If we apply the first filter, we can observe that 1925 songs have duration less than 400000 ms (i.e 400 s). So out of the 2017 songs, we need to only look at these 1925 songs for our next filter.
df_filt = df[df['duration_ms'] < 400000] #Our first filter in action

Now we will apply our second filter. We will arrange these songs from the filtered DataFrame (df_filt) in ascending order of energy, and extract last 10 values.
df_filt2 = df_filt.sort_values('energy')[['song_title', 'artist', 'duration_ms']].tail(10)

We have got the result we wanted! But now we observe these values are in ascending order. Since we have asked for Top 10 values, let’s convert them into descending order, by setting the property ascending=False
in sort_values()
function for better readability. And, just like that, our job is done!
ans = df_filt2.sort_values('energy', ascending=False)

We could have also arranged the rows in descending order while applying the second filter, and it would have been much more efficient. But I wanted to explain the concepts step-by-step and chose to do it this way.
Step 5— Exporting data as CSV
Suppose we do some data manipulation on a DataFrame and we get the expected results in a DataFrame. If we want to export it as a CSV, to be used somewhere else or to be stored as a back-up, we can do the following:
df_filtered = df.sort_values('duration_ms')[['song_title', 'artist', 'duration_ms']].tail(10)
df_filtered.to_csv(r'<File Path>filtered_data.csv')


We have barely scratched the surface of all the amazing stuff we can do using this library. There’s a saying that "Practice makes perfect’, so I will urge you to do the following activity before calling it a day! I know you have got it in you! Don’t give up yet!
Activities to sharpen skills
If you require additional practice, you can try doing the following activities:
- Find songs with value for
energy
column greater than 0.4 and display values forsong_title
column. - For songs with duration
duration_ms
value less than 300000 ms (300 s), find Top 10 "fastest" songs( songs with highest values fortempo
column ) and display values forsong_title, artist
columns. - Find songs with value for
artist
column = Drake and value forenergy
column greater than 0.5 and display values forsong_title
column.

Wrapping Up
I know you might have been tired and sleepy right now like the Panda above. But I have some good news, we are done with Part 1 of the Tutorial. I plan to post one more tutorial on Pandas, wherein I will introduce how to merge multiple Datasets and I will also perform Information Visualization using Matplotlib, which is yet another fun-filled Python library. I hope you had as much fun reading it, as I had writing it!
Until next time friends!