How to Make Your Pandas Code Run Faster

Two Pandas Tricks I Wish I’d Known Earlier

Liad Pollak Zuckerman
Towards Data Science

--

Most of us data scientists, come to use Pandas library at some point of our work. This post is going to present 2 tricks that will make your pandas code run faster. The first is for removing None values, and the second regards extracting a set of values from a certain column.

Photo by Stone Wang on Unsplash. Text was added by the author.

Removing None Values

Like most things in life, our data isn’t perfect. Often we have None values which we want to remove prior to some calculation. Whether its a simple average, a histogram or a more complicated calculation, we want those None out. How can we do it with pandas? Piece of cake! We can simply use pd.dropna() method. Assuming our dataframe is called df and we want to remove all the None values in column “date”:

import pandas as pd#load dataframe
df = pd.read_pickle("my_dataframe.pkl")
#remove rows with None in column "date"
df.dropna(subset=['date'])

We’ll call this method “dropna”. It was my go-to method for a long time... Until I encountered a piece of code that looked like this:

import pandas as pd#load dataframe
df = pd.read_pickle("my_dataframe.pkl")
#remove rows with None in column "date"
df = df[df["date"].notna()]

My first reaction was: “why not simply use dropna()?”
Well, I checked it out and the answer is simple. It might not be pretty, but it’s much faster!
Let’s look at the following comparison:

A comparison between notna and dropna methods. Graphs created by the author.

I’ve tested the two methods over various dataframe sizes between 10 and 10⁷. On the left it shows the time to remove None values as a function of the dataframe length. Since it increases significantly with the number of rows, it’s more clear to look at the time ratio between the two methods — Therefore the right graph. It shows <notna time>/<dropna time>

The conclusion is clear: nonta wins every time.

What is the catch?

There isn’t. No memory consumption charges.

A comparison between notna and dropna RAM consumption. Graphs created by the author.

Extracting a set from a column

Sometimes we need to get a set of the items from a certain column. I’m going to compare Pandas’ method drop_duplicates with Numpy’s method unique. For this part we’ll first remove the None values in column “date”. Here’s the code for the two methods:

Pandas’ drop_duplicates:

import pandas as pd#load dataframe
df = pd.read_pickle("my_dataframe.pkl")
#extract "date" column and remove None values
date = df[df["date"].notna()]
#get a set of column "date" values
date_set = date.drop_duplicates(subset=['date'])['date'].values

Numpy’s unique:

import pandas as pd
import numpy as np
#load dataframe
df = pd.read_pickle("my_dataframe.pkl")
#extract "date" column and remove None values
date = df[df["date"].notna()]['date'].values
#get a set of column "date" values
date_set = np.unique(date)

So I performed a quick comparison and thought again we have a winner — Numpy’s unique. However, after carefully testing for various dataframe sizes, I realized that in reality — it’s more complicated.

A comparison between unique and drop_duplicates methods. Graphs created by the author.

For short dataframes unique was faster than drop_duplicates. As the dataframe got larger and larger drop_duplicates got the upper hand, where the equality point lies somewhere around 10⁶ rows (where the time ratio is exactly 1). I tried to see the effect of the number of columns and got the same results. Here again memory consumption was similar between the two methods for all sizes.

A comparison between unique and drop_duplicates RAM consumption. Graphs created by the author.

Conclusion

When we want to perform an operation, we each have our go-to method. Often there’s another implementation that is worth checking out, it might save you time and resources.
* For removing None values use notna.
* For extracting a set, it depends on your typical dataframe length: Under 10⁵ rows use unique. Over 10⁶ use drop_duplicates.

For other operations, check the alternatives and let me know what you’ve found out!

--

--