
In Part I, we went over writing clean code and spotting common silent failures. In Part II, we’re working on speeding up your runtime and lowering your memory footprint.
I also made a Jupyter notebook with the whole lesson, both parts included.
This is some code setup from the previous part, but I’m including it here so that this part can stand on its own.
# for neatness, it helps to keep all of your imports up top
import sys
import traceback
import numba
import numpy as np
import pandas as pd
import numpy.random as nr
import matplotlib.pyplot as plt
% matplotlib inline# generate some fake Data to play with
data = {
"day_of_week": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] * 1000,
"booleans": [True, False] * 3500,
"positive_ints": nr.randint(0, 100, size=7000),
"mixed_ints": nr.randint(-100, 100, size=7000),
"lat1": nr.randn(7000) * 30,
"lon1": nr.randn(7000) * 30,
"lat2": nr.randn(7000) * 30,
"lon2": nr.randn(7000) * 30,
}
df_large = pd.DataFrame(data)
# let's copy this pro-actively, copy vs. view is in part i
large_copy = df_large.copy()
df_large.head()

Rev up your Pandas
Now that you have great coding habits, it’s time to try to up the performance. There’s a range of things you can use, from vectorization to just-in-time compilation to get your code running faster. To measure bottlenecks and quantify performance gains, let’s introduce timeit
, a nifty Jupyter notebook tool for performance measurement. All you need to know is that putting %timeit
before a single line of code will measure the runtime of that line, while putting %%timeit
in a code block will measure the runtime for the whole block.
For this example, I chose the haversine function because it’s a great example of a function that looks kind of complicated but is actually quite amenable to optimization. I found this example function on StackOverflow.
def haversine(lat1, lon1, lat2, lon2):
"""Haversine calculates the distance between two points on a sphere."""
lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
return c
%timeit haversine(100, -100, 50, -50)
> 14.3 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Three different ways of iterating through rows
You’ve probably known that Pandas is built off of Numpy to take advantage of optimizations only offered by its underlying C code while keeping its accessible Python interface. However, like anything with a foot in two worlds, you need to know where the borders are to take proper advantage of its dual nature. Depending on how you iterate through a DataFrame, you could either be taking full advantage of or totally disregarding those optimizations.
If you are completely new to Pandas and are just thinking of DataFrames as a wrapper for nested Python lists, then your first instinct may be to iterate through rows one at a time, such as by using df.iterrows()
.
%%timeit
# `iterrows` is a generator that yields indices and rows
dists = []
for i, r in large_copy.iterrows():
dists.append(haversine(r["lat1"], r["lon1"], r["lat2"], r["lon2"]))
large_copy["spherical_dist"] = dists
> 736 ms ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
740 ms per loop is 3/4 of a second and can really add up. 100,000 rows will end up taking you 20 hours.
There’s a somewhat optimized way with the df.apply()
method, which works on both DataFrame and Series objects. You define a custom function and send it through, and it will try to infer faster ways of processing the DataFrame columns faster in some cases.
large_copy["spherical_dist"] = large_copy.apply(
lambda r: haversine(r["lat1"], r["lon1"], r["lat2"], r["lon2"]), axis=1
)
> 387 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is an improvement, but it’s not amazing. What you need to do is use vectorization, which to put in absolutely simplistic terms, is to feed into a Numpy-compatible function a whole Series of values that would normally take one value at a time so that the C code is free to divide up the vector internally and do parallel processing. Numpy functions can usually take a single value, a vector, or a matrix. If the internal processing can act on both a single value or a vector of values at every step, you can use vectorization out of the box.
%timeit large_copy["spherical_dist"] = haversine(
large_copy["lat1"],
large_copy["lon1"],
large_copy["lat2"],
large_copy["lon2"]
)
> 2.17 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
That’s about a 100X speedup – 20 hours has shrunk down to about 4 minutes. Note that I haven’t changed the original function in any way. If you go look at the code, you can trace for yourself how the values can be both single values or a Series.
Some rules of thumb for iterating
Like with indexing, Pandas is flexible about how you want to go through the values of each row. The following are some rules of thumb:
- If you want to apply the same transformation to each value of of a column, you should use vectorization.
- If you need conditional vectorization, use boolean indexing.
- Also works on strings! i.e.
Series.str.replace("remove_word", "")
- You should only use
apply
for specific functions that can’t be broadcast. i.e.pd.to_datetime()
Just-in-time compilation with Numba
What if you can’t vectorize? Does this mean you’re stuck with df.apply()
? Not necessarily – if your code can be expressed as a combination of pure Python and Numpy arrays, you should give Numba a try and see if your code can be sped up for you. Writing Numba is nothing like writing Cython, which is a lot like writing a whole new programming language if you just know Python. Again, as long as your code can be expressed in pure Python and Numpy, it’s literally just putting a couple of decorators on top of the existing functions.
This example is based on some functions that calculate if a given complex number is part of the Mandlebrot set and visualizes the resulting fractal from trying every pixel coordinate. It was taken from the Numba docs.
def mandel(x, y, max_iters):
"""
Given the real and imaginary parts of a complex number,
determine if it is a candidate for membership in the Mandelbrot
set given a fixed number of iterations.
"""
i = 0
c = complex(x,y)
z = 0.0j
for i in range(max_iters):
z = z*z + c
if (z.real*z.real + z.imag*z.imag) >= 4:
return i
return 255
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
height = image.shape[0]
width = image.shape[1]
pixel_size_x = (max_x - min_x) / width
pixel_size_y = (max_y - min_y) / height
for x in range(width):
real = min_x + x * pixel_size_x
for y in range(height):
imag = min_y + y * pixel_size_y
color = mandel(real, imag, iters)
image[y, x] = color
return image
image = pd.DataFrame()
image["pixels"] = np.zeros(1, dtype=np.uint8)
%timeit create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20)
> 30.5 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Here are the exact same functions but with the Numba decorators on top. I had to make the input data very, very small because timeit
runs many iterations to measure the speed.
@numba.jit
def mandel(x, y, max_iters):
"""
Given the real and imaginary parts of a complex number,
determine if it is a candidate for membership in the Mandelbrot
set given a fixed number of iterations.
"""
i = 0
c = complex(x,y)
z = 0.0j
for i in range(max_iters):
z = z*z + c
if (z.real*z.real + z.imag*z.imag) >= 4:
return i
return 255
@numba.jit
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
height = image.shape[0]
width = image.shape[1]
pixel_size_x = (max_x - min_x) / width
pixel_size_y = (max_y - min_y) / height
for x in range(width):
real = min_x + x * pixel_size_x
for y in range(height):
imag = min_y + y * pixel_size_y
color = mandel(real, imag, iters)
image[y, x] = color
return image
image = pd.DataFrame()
image["pixels"] = np.zeros(1, dtype=np.uint8)
%timeit create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20)
> 1.63 ms ± 84 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
That’s 18X the speed for about 5 seconds of typing. If your function requires some intricate for
loops and conditions mixed together, it’s still worth taking a look to see if you can apply some Numba magic.
In case you were curious, this is what the create_fractal
function looks like when you apply it to a real input of all zeroes:

Manage your memory
If you’ve used Pandas for a while, you’ve probably had a medium-large DataFrame crash your Python process at some point. What’s puzzling is when the DataFrame doesn’t really seem that large, or maybe you remember processing something even bigger last week on the same laptop, but that one didn’t crash. What gives?
The biggest offenders of memory problems in Pandas are probably:
- You have references still attached to variables, which means they don’t get garbage collected.
- You have too many copies of DataFrames lying around.
- You could stand to do more in-place operations, which don’t produce copies of your DataFrame.
object
dtypes take up a lot more memory than fixed dtypes.
Garbage collection
Garbage collection is the process by which Python frees up memory by releasing memory that is no longer useful to the program. You can release the objects referenced by memory by removing the reference to that object. This flags the formerly referenced object for memory release.
The best way to let garbage collection help you manage memory is to wrap whatever you can into functions. Variables declared in functions are only scoped to the function, so when the function is finished running, they get discarded. On the other hand, global variables (like large_copy
) are kept around until the Python process ends (i.e. the notebook kernel is shut down). Even if you del
a variable, it just decreases the reference by 1, but if the reference count isn’t 0, the object referenced isn’t actually deleted. That’s why global variables can screw up what you think your memory is holding onto.
Just for fun, you can peek into what a variable’s reference count is by using sys.getrefcount(var_name)
.
# `foo` is a reference
foo = []
sys.getrefcount(foo) # this temporarily bumps it up to 2
> 2
# yet another global reference bumps it up again
foo.append(foo)
sys.getrefcount(foo)
> 3
At this point, del
won’t be able to garbage collect it, there are too many references in the global scope.
object
dtypes take up a lot of memory
It’s those pesky object
dtypes again! Not surprisingly, telling Pandas that you need to be able to store literally anything at any time somewhere means that it will pre-allocate a huge amount of initial memory for you for the thing you’re storing. This is fine if you’re storing something complex, but if you’re storing something that could easily be represented more simply, you might want to see if you can change the dtype to something better for your situation.
It’s actually easier than you may think to inspect how much memory your DataFrames are occupying. It’s already built into the DataFrame object.
large_copy.info(memory_usage="deep")
>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 8 columns):
day_of_week 7000 non-null object
booleans 7000 non-null bool
positive_ints 7000 non-null int64
mixed_ints 7000 non-null int64
lat1 7000 non-null float64
lon1 7000 non-null float64
lat2 7000 non-null float64
lon2 7000 non-null float64
dtypes: bool(1), float64(4), int64(2), object(1)
memory usage: 773.5 KB
A common practice when conserving memory is downcasting. For example, if you know your integers don’t need 64-bits, cast them down to 32-bits. But, as we’ll see, not all downcasting is equally impactful.
large_copy["positive_ints"] = large_copy["positive_ints"].astype(np.int32)
large_copy.info(memory_usage="deep")
>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 8 columns):
day_of_week 7000 non-null object
booleans 7000 non-null bool
positive_ints 7000 non-null int32
mixed_ints 7000 non-null int64
lat1 7000 non-null float64
lon1 7000 non-null float64
lat2 7000 non-null float64
lon2 7000 non-null float64
dtypes: bool(1), float64(4), int32(1), int64(1), object(1)
memory usage: 746.2 KB
A 3% reduction isn’t really remarkable. If you have string columns and they’re being stored as an object
type, they’ll always out-shadow floats and integers for memory gluttony.
All str
types are stored as object
in Pandas because they can be any length. You can downcast string columns to a fixed-length str
type. For example, this one limits to 10 characters:
large_copy["day_of_week"] = large_copy["day_of_week"].astype("|S10")
large_copy.info(memory_usage="deep")
>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 8 columns):
day_of_week 7000 non-null object
booleans 7000 non-null bool
positive_ints 7000 non-null int32
mixed_ints 7000 non-null int64
lat1 7000 non-null float64
lon1 7000 non-null float64
lat2 7000 non-null float64
lon2 7000 non-null float64
dtypes: bool(1), float64(4), int32(1), int64(1), object(1)
memory usage: 636.8 KB
An 18% reduction isn’t bad, but is there more we can do? You happen to know that there are only supposed to be 7 days of the week at most in day_of_week
. There’s a category
type that can take advantage of columns that are composed of a small set of repeating elements.
large_copy["day_of_week"] = large_copy["day_of_week"].astype("category")
large_copy.info(memory_usage="deep")
>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 8 columns):
day_of_week 7000 non-null category
booleans 7000 non-null bool
positive_ints 7000 non-null int32
mixed_ints 7000 non-null int64
lat1 7000 non-null float64
lon1 7000 non-null float64
lat2 7000 non-null float64
lon2 7000 non-null float64
dtypes: bool(1), category(1), float64(4), int32(1), int64(1)
memory usage: 315.2 KB
That’s a 59% reduction, just from one column alone!
Conclusion, Part II
I’m almost ashamed to admit it took me years of using Pandas before I started to look up optimizations, and even today, it’s not like I go around optimizing every single notebook I make. However, with a few simple tricks, you can see how the difference of a couple of lines of code or even just how you handle a single column could result in speedups of 10–100X and memory decreases of greater than 50%. It’s definitely a good set of tricks to have up your sleeves.
If you missed it, check out Part I: Writing good code and spotting silent failures.