The world’s leading publication for data science, AI, and ML professionals.

How to Avoid a Pandas Pandemonium

A deep dive into common Pandas mistakes. Part I: Writing good code and spotting silent failures.

Getting Started

Photo by Kevin Ku on Unsplash
Photo by Kevin Ku on Unsplash

When you first start out using Pandas, it’s often best to just get your feet wet and deal with problems as they come up. Then, the years pass, the amazing things you’ve been able to build with it start to accumulate, but you have a vague inkling that you keep making the same kinds of mistakes and that your code is running really slowly for what seems like pretty simple operations. This is when it’s time to dig into the inner workings of Pandas and take your code to the next level. Like with any library, the best way to optimize your code is to understand what’s going on underneath the syntax.

First in Part I, we’re going to eat our vegetables and cover writing clean code and spotting common silent failures. Then in Part II, we’ll get to speeding up your runtime and lowering your memory footprint.

I also made a Jupyter notebook with the whole lesson, both parts included.

First, let’s make some fake Data to play with. I’ll make a large DataFrame to illustrate larger processing problems and a small DataFrame to illustrate point changes in a localized way.

# for neatness, it helps to keep all of your imports up top
import sys
import traceback
import numba
import numpy as np
import pandas as pd
import numpy.random as nr
import matplotlib.pyplot as plt

% matplotlib inline
# generate some fake data to play with
data = {
    "day_of_week": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] * 1000,
    "booleans": [True, False] * 3500,
    "positive_ints": nr.randint(0, 100, size=7000),
    "mixed_ints": nr.randint(-100, 100, size=7000),
    "lat1": nr.randn(7000) * 30,
    "lon1": nr.randn(7000) * 30,
    "lat2": nr.randn(7000) * 30,
    "lon2": nr.randn(7000) * 30,
}

df_large = pd.DataFrame(data)
df_large.head()
small = {
    'a': [1, 1],
    'b': [2, 2]
}
df_small = pd.DataFrame(small)
df_small

Great, now we’re all set up!

Writing good code

Before we do the "cool" stuff like writing faster and more memory-optimized code, we need to do it on a foundation of some fairly mundane-seeming coding best practices. These are the little things, such as naming things expressively and writing sanity checks, that will help keep your code maintainable and readable by your peers.

Sanity checking is simple and totally worth it

Just because this is mostly just data analysis, and it might not make sense to put up a whole suite of unit tests for it, doesn’t mean you can’t do any kind of checks at all. Peppering your notebook code with assert can go a long way without much extra work.

Above, we made a DataFrame df_large that contains numbers with some pre-defined rules. For example, you can check for data entry errors by trimming whitespace and checking that the number of entries stays the same:

large_copy = df_large.copy()
assert large_copy["day_of_week"].str.strip().unique().size == large_copy["day_of_week"].unique().size

Nothing should happen if you run this code. However, modify it slightly to break it and…

large_copy.loc[0, "day_of_week"] = "Monday "
assert large_copy["day_of_week"].str.strip().unique().size == large_copy["day_of_week"].unique().size

Bam! AssertionError. Suppose you had some code that drops duplicates and ships the delta to a client. Little whitespace additions would be missed and sent off as its own unique data point. It’s not a big deal when it’s the days of the week and you can spot check it easily, but if it’s potentially thousands or more of unique strings, these kinds of checks can save you a big headache.

Use consistent indexing

Pandas grants you a lot of flexibility in indexing, but it can add up to a lot of confusion later if you’re not disciplined about keeping a consistent style. This is one proposed standard:

# for getting columns, use a string or a list of strings for multiple columns
# note: a one-column DataFrame and a Series are not the same thing
one_column_series = df_large["mixed_ints"]
two_column_df = df_large[["mixed_ints", "positive_ints"]]
# for getting a 2D slice of data, use `loc`
data_view_series = df_large.loc[:10, "positive_ints"]
data_view_df = df_large.loc[:10, ["mixed_ints", "positive_ints"]]
# for getting a subset of rows, also use `loc`
row_subset_df = df_large.loc[:10, :]

This is just a combination of what I personally use and what I’ve seen others do. Here’s another set of ways to do the exact same things as above, but you may see why they either don’t have as much adoption or are discouraged from use.

# one way is to use `df.loc` for everything, but it can look clunky
one_column_series = df_large.loc[:, "mixed_ints"]
two_column_df = df_large.loc[:, ["mixed_ints", "positive_ints"]]
# you can use `iloc`, which is `loc` but with indexes, but it's not as clear
# also, you're in trouble if you ever change the column order
data_view_series = df_large.iloc[:10, 2]
data_view_df = df_large.iloc[:10, [3, 2]]
# you can get rows like you slice a list, but this can be confusing
row_subset_df = df_large[:10]

Here’s a little gotcha about that last example: df_large[:10] gets you the first 10 rows, but df_large[10] gets you the 10th column. This is why it’s so important to index things as clearly as possible, even if it means you have to do more typing.

But don’t use chained indexing

What is chained indexing? It’s when you separately index the columns and the rows, which will make two separate calls to __getitem__() (or worse, one call to __getitem__() and one to __setitem__() if you’re making assignments, which we demonstrate below). It’s not so bad if you’re just indexing and not making assignments, but it’s still not ideal from a readability standpoint because if you index the rows in one place and then index a column, unless you’re very disciplined about variable naming, it’s easy to lose track of what exactly you indexed.

# this is what chained indexing looks like
data_view_series = df_large[:10]["mixed_ints"]
data_view_df = df_large[:10][["mixed_ints", "positive_ints"]]
# this is also chained indexing, but low-key
row_subset_df = df_large[:10]
data_view_df = row_subset_df[["mixed_ints", "positive_ints"]]

The second example isn’t something people usually intend to do, it often happens when people modify DataFrames in nested for loops, where first you iterate through each row, but then you want to do something to some of the columns. If you get a SettingWithCopyWarning, try to look for these kinds of patterns or anything where you might be separately slicing rows and then slicing columns (or vice versa) and replace them with df.loc[rows, columns].

Common silent failures

Even if you do all of the above, sometimes Pandas’ flexibility can lull you into making Mistakes that don’t actually make you error out. These are particularly pernicious because you often don’t realize something is wrong until something far downstream doesn’t make sense, and it’s very hard to trace back to what the cause was.

View vs. Copy

A view and a copy of a DataFrame can look identical to you in terms of the values it contains, but a view references a piece of an existing DataFrame and a copy is a whole different DataFrame. If you change a view, you change the existing DataFrame, but if you change a copy, the original DataFrame is unaffected. Make sure you aren’t modifying a view when you think you’re modifying a copy and vice versa.

It turns out that whether you’re dealing with a copy or a view is very difficult to predict! Internally, Pandas tries to optimize by returning a view or a copy depending on the DataFrame and the actions you take. You can force Pandas to make a copy for you by using df.copy() and you can force Pandas to operate in place on a DataFrame by setting inplace=True when it’s available.

When to make a copy and when to use a view? It’s hard to say for sure, but if your data is small or your resources are large and you want to go functional and stateless, you can try making a copy for every operation, like Spark would, since it’s probably the safest way to do things. On the other hand, if you have lots of data and a regular laptop, you might want to operate in place to prevent your notebooks from crashing.

# intentionally making a copy will always make a copy
small_copy = df_small.copy()
small_copy

While they look identical, this is actually a totally separate data object than the df_small we made earlier.

Let’s look at what inplace=True does. df.drop() is method that allows you to drop columns from a DataFrame, and it’s one of the Pandas DataFrame methods that allow you to specify if you want the operation to occur in-place or if you want it to return a new object. Returning a new object, or inplace=False is always the default.

small_copy.drop("b", axis=1)
small_copy

What happened to the drop? It returned a new object, but you didn’t point to it with a variable, so it disappeared. The original, small_copy, remains unmodified. Now let’s turn on in-place modification:

small_copy.drop("b", axis=1, inplace=True)
small_copy

The original, small_copy, has now been changed. Good thing we made a copy!

Let’s remake small_copy and make a point modification to see what I meant earlier about the original DataFrame being a different object.

# let's see what happens if you assign to a copy
small_copy = df_small.copy()
# you should always use `loc` for assignment
small_copy.loc[0, 'b'] = 4
small_copy
# original is unchanged
df_small

Watch out for out of order processing

In Jupyter notebooks, it’s almost unavoidable to change and re-process cells out of order – we know we shouldn’t do it, but it always ends up happening. This is a subset of the view vs. copy problem because if you know that you’re making a change that fundamentally alters the properties of a column, you should eat the memory cost and make a new copy, or something like this might happen where you run the latter two cells in this block over and over and see the max value get pretty unstable.

large_copy = df_large.copy()
large_copy.loc[0, "positive_ints"] = 120
large_copy["positive_ints"].max()
> 120

Remember, positive_ints was set to be between 0 and 100, meaning that setting the first value to 120 means the max value has to be 120 now. But what happens if I run a cell like this a few times?

large_copy["positive_ints"] = large_copy["positive_ints"] * 3
large_copy["positive_ints"].max()
> 360

The first time you run it, it behaves as expected, but run it again…

large_copy["positive_ints"] = large_copy["positive_ints"] * 3
large_copy["positive_ints"].max()
> 1080

This example seems obvious, but it’s really easy to make cells that mutate data irreversibly and in a way that builds on itself every time you re-run it. The best way to avoid this problem is to make a new DataFrame with a new variable name when you’re starting a cell that will make significant in-place data mutations.

large_copy_x3 = large_copy.copy()
large_copy_x3["positive_ints"] = large_copy["positive_ints"] * 3
large_copy_x3["positive_ints"].max()
> 360

No matter how many times you run that block, it will always return 360, just as you originally expected.

Never set errors to "ignore"

Some Pandas methods allow you to ignore errors by default. This is almost always a bad idea because ignoring errors means it just puts your unparsed input in place of where the output should have been. Let’s make a Series with two normal dates and a Futurama-like one.

parsed_dates = pd.to_datetime(["10/11/2018", "01/30/1996", "04/15/9086"], format="%m/%d/%Y", errors="ignore")
parsed_dates
> array(['10/11/2018', '01/30/1996', '04/15/9086'], dtype=object)

Note in the example output that if you were not overly familiar with what Pandas outputs should be, seeing an output of array type might not seem that unusual to you, and you might just move on with your analysis, not knowing that something had gone wrong.

If you turn off errors="ignore", you will see the traceback:

Traceback (most recent call last):
  File "<ipython-input-22-12145c38fd6e>", line 3, in <module>
    pd.to_datetime(["10/11/2018", "01/30/1996", "04/15/9086"], format="%m/%d/%Y")
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 9086-04-15 00:00:00

What happened here is that Python timestamps are indexed in nanoseconds, and that number is stored as an int64, so any year that is later than roughly 2262 will overflow its ability to be stored. As a data scientist, you may be forgiven for not knowing that little bit of esoterica, but this is just one of the many hidden idiosyncrasies buried in Python/Pandas, so ignore at your peril.

This is what a date Series is supposed to look like, in case you wondering:

pd.to_datetime(["10/11/2018", "01/30/1996"], format="%m/%d/%Y")
> DatetimeIndex(['2018-10-11', '1996-01-30'], dtype='datetime64[ns]', freq=None)

The object dtype can hide mixed types

Each Pandas column has a type, but there is an uber-type called object, which means each value is actually just a pointer to some arbitrary object. This allows Pandas to have a great deal of flexibility (i.e. columns of lists or dictionaries or whatever you want!), but it can result in silent fails.

Spoiler alert: this won’t be the first time object type causes us problems. I don’t want to say you shouldn’t use it, but once you’re in production mode, you should definitely use it with caution.

# we start out with integer types for our small data
small_copy = df_small.copy()
small_copy.dtypes
> 
a    int64
b    int64
dtype: object
# reset ones of the column's dtype to `object`
small_copy["b"] = small_copy["b"].astype("object")
small_copy.dtypes
>
a    int64
b    object
dtype: object

Now let’s make some mischief again and mutate the b column with probably one of those most frustrating silent failures out there.

small_copy["b"] = [4, "4"]
small_copy

(I’m sure for a lot of you, this is the data equivalent of nails on a chalkboard.)

Now let’s say you need to drop duplicates and ship the result. If you dropped by column a, you would get something expected:

small_copy.drop_duplicates("a")

But if you dropped by column b, this is what you would get:

small_copy.drop_duplicates("b")

Tread carefully with Pandas schema inference

When you load in a big, mixed-type CSV and Pandas gives you the option to set low_memory=False when it encounters some data it doesn’t know how to handle, what it’s actually doing is just making that entire column object type so that the numbers it can convert to int64 or float64 get converted, but the stuff it can’t convert just sit there as str. This makes the column values able to peacefully co-exist, for now. But once you try to do any operations on them, you’ll see that Pandas was just trying to tell you all along that you can’t assume all the values are numeric.

Note: Remember, in Python, NaN is a float! So if your numeric column has them, cast them to float even if they’re actually int.

Let’s create an intentionally obtuse DataFrame and save it somewhere so we can read it back in. This DataFrame has a combination of integers and a string that the (most likely human, probably working in Excel) writer thought was an indication of a NaN value but is actually not covered by the default NaN parsers.

mixed_df = pd.DataFrame({"mixed": [100] * 100 + ["-"] + [100] * 100, "ints": [100] * 201})
mixed_df.to_csv("test_load.csv", header=True, index=False)

When you read it back in, Pandas will attempt to do schema inference, but it will stumble on the mixed types column and not try to be too opinionated about it and leave the values as they are in an object column.

mixed_df = pd.read_csv("test_load.csv", header=0)
mixed_df.dtypes
>
mixed    object
ints     int64
dtype: object

The best way to make sure your data is interpreted correctly is to set your dtypes manually and specify your NaN characters. It’s a pain, and you don’t always have to do it, but it’s always a good idea to open up the first few lines of any data file you have and make sure it can really be automatically parsed.

mixed_df = pd.read_csv("test_load.csv", header=0, dtype={"mixed": float, "ints": int}, na_values=["-"])
mixed_df.dtypes
>
mixed    float64
ints     int64
dtype: object

Conclusion, Part I

Pandas is probably the greatest thing since sliced lists, but it’s auto-magic until it’s not. To understand when and how you can let it take the wheel and when you have to do a manual override, you either need to have a solid grasp of its internal workings, or you can be like me and hit your head against them for half a decade and try to learn from your mistakes. As the saying goes, fool me once, shame on you. Fool me 725 times…can’t fool me again.

Please check out Part II: Speed-ups and memory optimizations.


Related Articles