An Introduction to Pandas in Python

Bruno Santos
Towards Data Science
8 min readAug 5, 2019

--

No, this blog post isn’t about the panda bear. Sorry to disappoint. Source.

The readme in the official pandas github repository describes pandas as “a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way towards this goal.”

The pandas library is a powerful Python data analysis toolkit. Source.

Background on Pandas

According to Wikipedia, pandas is “a software library written for the Python programming language for data manipulation and analysis.” Sounds great, doesn’t it? But what does it really mean and how is pandas applicable and useful for a data scientist? In this blog post, I will detail some of the uses for pandas, giving examples along the way. This is meant to be a brief introduction into the pandas library and its capabilities rather than an all-encompassing deep dive. For more detailed information, please see the pandas github repository here, or the official pandas documentation here.

Wes McKinney — the godfather of pandas. Source.

It was initially developed by Wes McKinney in 2008 while working at AQR Capital Management. He was able to convince AQR to allow him to open source the library, which not only allows, but encourages data scientists across the globe to use it for free, make contributions to the official repository, provide bug reports and fixes, documentation improvements, enhancements, and provide ideas for improving the software.

Getting Started with Pandas

Once you’ve installed pandas, the first step is importing it. You can find more information on installing pandas here. Below is the commonly used shortcut for pandas. While you don’t need to import pandas using an alias, it helps to use the alias so you can use pd.command rather than typing out pandas.command every time you need to call a method or property.

import pandas as pd

Once you’ve imported the library, let’s import a dataset so we can begin to look at pandas and its functionality. We’ll use a dataset from the online version of “An Introduction to Statistical Learning with Applications in R”, which datasets can be found here.

Pandas is able to read several different types of stored data, including CSVs (comma separated values), TSVs (tab separated values), JSONs (JavaScript Object Notation, HTML (Hypertext Markup Language), among others. For our example, we’ll use a CSV and read in our data that way.

auto = pd.read_csv('http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.csv')

In the above block of code, we’ve read in our CSV and saved it as a variable ‘auto’. The link within the single quotes is our absolute file path — or where the file is actually stored. We can also pass in a relative file path if our file is stored locally, which gives us the target file relative to where we currently sit within our directory.

The Difference Between a Series and a DataFrame

In pandas, we have two main data structures that we can explore. The first is a DataFrame and the second is a Series. So what’s the different between the two?

A DataFrame is a two-dimensional array of values with both a row and a column index.

A Series is a one-dimensional array of values with an index.

If it’s not clear yet what the distinction is between the two based on those one-sentence explanations, hopefully the picture below helps.

A Series, on the left, and a DataFrame on the right. Image taken from General Assembly’s “Intro to Pandas” lesson.

If it looks like the picture on the left is also present in the picture on the right, you’re right! Where a DataFrame is the entire dataset, including all rows and columns — a Series is essentially a single column within that DataFrame. Creating these two data structures is a fairly straightforward process in pandas as shown below.

pd.DataFrame(data = [
['NJ', 'Towaco', 'Square'],
['CA', 'San Francisco', 'Oval'],
['TX', 'Austin', 'Triangle'],
['MD', 'Baltimore', 'Square'],
['OH', 'Columbus', 'Hexagon'],
['IL', 'Chicago', 'Circle']
],
columns = ['State', 'City', 'Shape'])
pd.Series(data=['NJ', 'CA', 'TX', 'MD', 'OH', 'IL'])* Side Note: Python is able to make use of "white space" so the code is easier to read - hence the new lines and spacing.

The above code will create the DataFrame and Series that we see in our image just above. Passing in the data is necessary for creating DataFrames and Series so that they are not empty. There are additional parameters that can be passed into each of these, however for our purposes we only need the ones passed. For more information on what these additional parameters do, please take a look at the documentation on DataFrames and Series.

Some Useful Pandas Methods and Properties for Exploratory Data Analysis

One of the first things any good data scientist should do when exploring their data is to explore it. One of the best (and easiest) ways to accomplish this is to look at the first few rows of data. By calling the .head() method will display, by default, the first five rows of the DataFrame and include the column headers. You can modify how many rows are displayed by passing a number into the parentheses, which will change how many rows are shown.

auto.head()
A look at the first five rows of our dataset.

Similarly, we can also take a look at the last few rows of our dataset. The .tail() method will, by default, show the last five rows of our DataFrame. Similar to .head(), we can pass a number into our method to modify how many rows are displayed. Let’s take a look at the last three rows of our DataFrame just to give you a feel for how we can modify .tail().

auto.tail(3)
A look at the last three rows of our dataset.

In addition to looking at the first or last few rows, it may be helpful to look at how many rows and columns are in your dataset. Let’s take a look at the .shape property.

auto.shape
The shape of our dataset, in rows and columns.

Here, our dataset is shown to have 397 rows and 9 columns. Based on our look at the .head() and .tail() of our dataset, we saw that our dataset started at index 0 and ran through index 396, confirming we have 397 total rows. But instead of just looking at how many rows, we may want to look at how many values within our dataset are missing. For this, we can call the .isnull() method.

The first five rows of our dataframe when .isnull() is called.

In the picture above, I captured the first 5 rows for the picture’s sake, but you can achieve this same limited set of data by calling .isnull().head(). One of the coolest things about pandas and python is that it allows you to stack methods and properties. What .isnull() does is return a Boolean term for each value and tell us whether it is missing (is null, or True), or whether it is not missing (not null, or False).

Another way of looking at the null values is to stack another method onto our .isnull() method. If we look at .isnull().sum() we will see just how many null values there are in each of our columns.

auto.isnull().sum()
How many null values are in each column?

Luckily for us, none of the above columns have any null values — so we don’t have to worry about dropping null values, or filling null values.

Another useful bit of information that may be useful to us is looking at what type of data we have in our dataset. To do this, we can use the .dtypes property. This will give us a snapshot of what type of data is contained in each column.

auto.dtypes
A look at our datatypes within our DataFrame

Here, we can see that we have floats (numbers with decimal places), integers (whole numbers), and objects (strings).

If we want to look at a quick summary of all of the information we’ve looked at so far, one particularly useful method that we can call is .info() which essentially aggregates everything that we’ve explored so far!

auto.info()
.info() shows us just about everything we’ve covered so far.

Here, we can see what the type of our auto variable is (pandas.core.frame.DataFrame) — it’s a DataFrame! The RangeIndex tells us how many total entries there are, from index 0 to index 396 and Data columns shows us how many total columns there are — in this case we have 9 columns. Each individual column shows us how many entries there are, how many are null, and the type of that specific column. At the bottom, the dtypes shows us which data types we have, and how many of each. Finally, the memory usage shows us how much memory our DataFrame actually uses.

Hopefully you’ve found this to be a quick introduction to pandas, some of its methods and properties, and it helps to show you why pandas can be so useful when looking at — and analyzing — datasets. While this was just a very basic intro to pandas and its methods and properties, hopefully I’ll get the opportunity to jump into a deeper dive in pandas to help clarify some things that were overlooked. If you liked this blog post, please remember to check out all of the pandas documentation to see just how useful it can be to you!

--

--