I’ve been using pandas as my main tool for data analysis for the last 3 years. I must admit that most of "How NOT to code with pandas" comes from my beginnings. While doing code reviews I still see many of "Hot NOT to-s" with more experienced programmers.
In this article, I first show an example of "how NOT to" and then show a proper "how TO" way of calculating statistics with pandas.
The arguments for improvement are concise, more readable code and faster execution. Reported times are in a format: 831 ms ± 25.7 ms per loop
, which means 831 milliseconds on average with 25.7 milliseconds standard deviation. Each code sample is executed multiple times to calculate accurate execution time.
Here are a few links that might interest you:
- How To Create Date Series in SQL [Article]
- Data Science for Business Leaders [Course]
- Intro to Machine Learning with PyTorch [Course]
- Become a Growth Product Manager [Course]
- Free skill tests for Data Scientists & Machine Learning Engineers
Some of the links above are affiliate links and if you go through them to make a purchase I’ll earn a commission. Keep in mind that I link courses because of their quality and not because of the commission I receive from your purchases.
To step up your Pandas game, see:
Sample Dataset
The sample dataset contains booking information for various cities. It is random and its only purpose is to show the examples.
The dataset has 3 columns:
- id is a unique id,
- city is the city of booking,
- booked_perc is the percentage of bookings at a certain time.
The dataset has 10K entries to make speed improvements more noticeable. Pandas can calculate statistics on DataFrames with millions of rows if the code is written on proper pandas-way.
import pandas as pd
import numpy as np
size = 10000
cities = ["paris", "barcelona", "berlin", "new york"]
df = pd.DataFrame(
{"city": np.random.choice(cities, size=size), "booked_perc": np.random.rand(size)}
)
df["id"] = df.index.map(str) + "-" + df.city
df = df[["id", "city", "booked_perc"]]
df.head()

1. How NOT to sum the data

I came from Java world and I brought "for loops in multiple lines" with myself to Python.
Let’s calculate the sum of a booked_perc column – I know that it doesn’t make sense to sum percentages, but let’s do it anyway 🙂
%%timeit
suma = 0
for _, row in df.iterrows():
suma += row.booked_perc
766 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The more Pythonic way to sum the values of a column is:
%%timeit
sum(booked_perc for booked_perc in df.booked_perc)
989 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.booked_perc.sum()
92 µs ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As expected the first example is the slowest – it takes almost 1 second to sum 10k entries. It surprised me by how fast is the second example. The proper way of summing the data with pandas (or using any other operation on a column) is the third example – also the fastest!
2. How NOT to filter the data

Despite being quite experienced with numpy before starting using pandas, I was filtering data in a for loop. You can observe the performance hit while calculating the sum.
%%timeit
suma = 0
for _, row in df.iterrows():
if row.booked_perc <= 0.5:
suma += row.booked_perc
831 ms ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df[df.booked_perc <= 0.5].booked_perc.sum()
724 µs ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As expected the second example is few orders of magnitude faster than the first one.
What about adding more filters? We simply wrap them in parentheses:
%%timeit
df[(df.booked_perc <= 0.5) & (df.city == 'new york')].booked_perc.sum()
1.55 ms ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3. How NOT to access previous values

You might say: Ok, but what about when I to need to access the value of a previous column? Then I need a for loop. Wrong!
Let’s calculate a percentage change from one row to another with and without a for loop.
%%timeit
for i in range(1, len(df)):
df.loc[i, "perc_change"] = (df.loc[i].booked_perc - df.loc[i - 1].booked_perc) / df.loc[i - 1].booked_perc
7.02 s ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df["perc_change"] = df.booked_perc.pct_change()
586 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Again, the second example is much faster than the first one with the for loop.
Pandas many has functions that can calculate a statistic based on previous values (eg. shift
function to lag a value). Those functions take periods
argument that defines the number of previous values to be included in the calculation.
4. How NOT to apply complex functions

Sometimes, we need to apply a complex function (a function with multiple variables) to a DataFrame. Let’s say we would like to multiply booking_perc by two from New York, set others to 0 and name the column sales_factor.
The first approach that comes to my mind is using a for loop with iterrows.
%%timeit
for i, row in df.iterrows():
if row.city == 'new york':
df.loc[i, 'sales_factor'] = row.booked_perc * 2
else:
df.loc[i, 'sales_factor'] = 0
3.58 s ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A slightly better approach is using apply function directly on a DataFrame.
%%timeit
def calculate_sales_factor(row):
if row.city == 'new york':
return row.booked_perc * 2
return 0
df['sales_factor'] = df.apply(calculate_sales_factor, axis=1)
165 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The fastest way is to use pandas filter and calculate the function values directly.
%%timeit
df.loc[df.city == 'new york', 'sales_factor'] = df[df.city == 'new york'].booked_perc * 2
df.sales_factor.fillna(0, inplace=True)
3.03 ms ± 85.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We can observe the speedup from the first to the last example.
When we are dealing with function with 3 or more variables, we can break it down to multiple pandas expressions. It will be faster than using the apply function.
Eg: f(x, a, b) = (a + b) * x
df['a_plus_b'] = df['a'] + df['b']
df['f'] = df['a_plus_b'] * df['x']
5. How NOT to group data

As you’ve seen by now, I was relying on for loops a lot when I started using pandas. With the grouping of data, you can reduce the number of lines of code when using pandas to its full potential.
Let’s say we would like to calculate:
- the average sales factor by the city
- and the first booking id by the city.
%%timeit
avg_by_city = {}
count_by_city = {}
first_booking_by_city = {}
for i, row in df.iterrows():
city = row.city
if city in avg_by_city:
avg_by_city[city] += row.sales_factor
count_by_city[city] += 1
else:
avg_by_city[city] = row.sales_factor
count_by_city[city] = 1
first_booking_by_city[city] = row['id']
for city, _ in avg_by_city.items():
avg_by_city[city] /= count_by_city[city]
878 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pandas has a group by operation so there is no need to iterate over a DataFrame. group by in pandas does the same thing as GROUP BY statement in SQL.
%%timeit
df.groupby('city').sales_factor.mean()
df.groupby('city').sales_factor.count()
df.groupby('city').id.first()
3.05 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.groupby("city").agg({"sales_factor": ["mean", "count"], "id": "first"})
4.5 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Surprisingly, the third example is not the fastest, but it is more concise than the second example. I would suggest you use the second approach when you need to speed up your code.
Conclusion

My advice is:
If you are using for loops with pandas, there is most probably a better way to write it.
There exist computationally expensive functions where even the optimizations above don’t help. Then we need to use the last resort: Cython and Numba, which I am going to cover in the following weeks.
Did you enjoy the post? Learned something new? Let me know in the comments below.
Before you go
As usual, you can download this Jupyter Notebook to try examples on your machine.
Follow me on Twitter, where I regularly tweet about Data Science and Machine Learning.
