Categorical datatypes are often touted as an easy win for cutting down DataFrame memory usage in pandas, and they can indeed be a useful tool. However, if you imagined you could just throw in a .astype("category")
at the start of your code and have everything else behave the same (but more efficiently), you’re likely to be disappointed.
This article focuses on some of the real world problems you are likely to face when using categorical datatypes in pandas; either adjusting your existing mindset to write new code using categories, or trying to migrate existing pipelines into flows using categorical columns.
The behaviour described in this article is current as of pandas==1.2.3
(released March 2021), but don’t worry if you’re reading this at a much later date, the behaviour described is unlikely to change significantly in future versions – but leave a comment if it has!
Why use categorical datatypes in pandas?
If categorical data can be a pain, then why not just avoid it altogether? Well, using categories can bring some significant benefits:
- Memory usage – for string columns where there are many repeated values, categories can drastically reduce the amount of memory required to store the data in memory
- Runtime performance – there are optimizations in place which can improve execution speed for certain operations
- Library integrations – in some cases libraries have special functionality for categorical columns, for example
lightgbm
handles categorical columns differently when building models
The Happy Path
Let’s do an obligatory ‘happy path’ example. Here’s what things look like in a simple world where everyone smiles at each other:
df_size = 100_000
df1 = pd.DataFrame(
{
"float_1": np.random.rand(df_size),
"species": np.random.choice(["cat", "dog", "ape", "gorilla"], size=df_size),
}
)
df1_cat = df1.astype({"species": "category"})
Here we have created two dataframes, df1
which contains species
as an object column and df1_cat
which is a copy of the same dataframe but with species
as a categorical datatype.
>> df1.memory_usage(deep=True)
Index 128
float_1 800000
species 6100448
dtype: int64
We can see here how expensive it is to keep a column with strings in terms of our memory usage – here the column of strings occupies about 6MB, if these strings were longer, it would take even more; compare that to the 0.8MB taken up by the float column. We could afford nearly 8 float64
columns for the price of that one string column…pricey.
>> df1_cat.memory_usage(deep=True)
Index 128
float_1 800000
species 100416
dtype: int64
Looking at the memory usage after having cast to a category we see a pretty drastic improvement, about 60x less memory used, very nice. We can now afford 8 of these string columns for the price of one float64
column, oh how the tables have turned.
This is cool, however, it’s only really cool if we can keep it that way…
Categorical columns are fragile things
Seems like a weird thing to say? Well… working with categories can be a lot like playing with those wobbly dolls that spring right back up when you push them over; they will all too easily rock right back into object
s if you don’t pay very close attention to each operation using the categorical columns.
In the best case scenario, this is annoying, in the worst case it kills your performance (because casting types is expensive), and/or blows out your memory (because you could only fit this data into your memory when stored as a category).
Operating on categorical columns

It’s probable that at some point you’re going to want to do something with your categorical columns, one of those things might be a transformation. This is the first place that we’re going to have to show some diligence…
Since categorical columns are often text based columns let’s look at an example using string manipulations, we can do these manipulations on categorical columns in the same way that we do ordinarily for text based object
columns; by using the .str
accessor.
On non-categorical string series:
>> %timeit df1["species"].str.upper()
25.6 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
On categorical string series:
>> %timeit df1_cat["species"].str.upper()
1.85 ms ± 41.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The categorical version is a clear winner on performance, about 14x faster in this case (this is because the internal optimizations mean that the .str.upper()
is only called once on the unique category values and then a series constructed from the outcome, instead of once per value in the series). However, this is where we run into our first major gotcha…
>> df1_cat["species"].str.upper().memory_usage(deep=True)
6100576
We’ve lost our categorical type, the result is an object
type column and the data compression is gone; the result is now once again at it’s 6MB size. We could recast back to a category after this but that is a lot of to-ing and fro-ing between types, it makes our code messier and doesn’t reduce our peak memory usage.
Efficient alternatives
Oftentimes an efficient alternative is to rewrite your code manipulating categorical columns to operate directly on the categories themselves rather than on the series of their values. This takes a bit of change in mindset (and implementation), you can think of it as just doing an operation once for each unique value in the column, rather than each instance in the column. If you only have a column 2 unique values in its categories spread over 100,000 rows, it makes sense that you would only do the operation 2 times, rather than 100,000 times.
For example, in the above case we could do the following:
%timeit df1_cat["species"].cat.rename_categories(str.upper)
239 µs ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
That’s about 10x faster than either of the previous options, and the main benefit is that we never convert the categorical series into an expensive intermediate object
column state, so we keep our memory efficiency, very nice.
It’s a toy example but the principles hold for more complex examples, if there isn’t a specific .cat
accessor method which helps for your specific case, consider operating on the df1_cat["species"].dtype.categories
which contain the unique categorical values, rather than on the whole series.
Merging with categorical columns

In all but the simplest of use cases, we are likely to have not just one dataframe, but multiple dataframes which we’ll probably want to stick together at some point. For an example, we’re going to drum up a small ‘reference’ dataset which contains the habitats of the species of our first dataset:
df2 = pd.DataFrame(
{
"species": ["cat", "dog", "ape", "gorilla", "snake"],
"habitat": ["house", "house", "jungle", "jungle", "jungle"],
}
)
df2_cat = df2.astype({"species": "category", "habitat": "category"})
As before, we’ve created one categorical version of this dataset, and one with object
strings. One thing that’s important to note here is that we have an extra species (snake
) which we don’t have in df1
which is the dataframe we’ll be merging with, this will be important later (but don’t worry there won’t be a test).
I won’t show an example of merging together two object
columns because you all know what happens, object
+ object
= object
, there is no magic, it’s just a merge.
Merging an object column on a categorical column
For this little test, we’re going to take one of our categorical dataframes and merge it with the object type column on another dataframe.
>> df1.merge(df2_cat, on="species").dtypes
float_1 float64
species object
habitat category
dtype: object
So here we had species as object
on the left and category
on the right. We can see that when we merge we get category
+ object
= object
for the merge column in the resultant dataframe. So blah blah blah, this hits us in the memory again when we snap back to object
s. Not super surprising, but again, something to be mindful of when working with categories.
Merging two categorical columns
Hopefully I’ve set you up to think that I’m leading up to category
+ category
= category
. Well let’s take a look:
>> df1_cat.merge(df2_cat, on="species").dtypes
float_1 float64
species object
habitat category
dtype: object
Call the police, I have been deceived… category
+ category
= object
.
So what’s the sting in the tail? Well in a merge, in order to preserve the categorical type, the two categorical types must be exactly the same. Unlike the other data types in Pandas (where, for example, all float64
columns have the same data type), when we talk about the categorical datatypes, the datatype is actually described by the set of values that can exist in that particular category, so you can imagine that a category containing ["cat", "dog", "mouse"]
is a different type to the category containing ["cheese", "milk", "eggs"]
. Although in our case they appear pretty much the same (one is a subset of the other), in df2
we have a species present which isn’t present in df1
(the slippery snake
).
We can think of the behaviour on merge columns like this:
category1
+ category2
= object
category1
+ category1
= category1
So adapting the previous example we can get the result we want and expect:
>> df1_cat.astype({"species": df2_cat["species"].dtype}).merge(
df2_cat, on="species"
).dtypes
float_1 float64
species category
habitat category
dtype: object
Above it can be seen that setting the categorical types to match and then merging gives us the desired results… finally.
Grouping with categorical columns

Grouping with categories has personally bitten me in the ankles a couple of times when adapting code to work with categorical datatypes. Once causing behaviour to change unexpectedly, giving a dataframe full of null values and another time causing the operation to hang indefinitely (even though it previously took only a couple of seconds with object
datatypes).
When you group on a categorical datatype, by default you group on every value in the datatype even if it isn’t present in the data itself.
What does that mean? It’s probably best illustrated with an example.
habitat_df = (
df1_cat.astype({"species": df2_cat["species"].dtype})
.merge(df2_cat, on="species")
)
house_animals_df = habitat_df.loc[habitat_df["habitat"] == "house"]
So here habitat_df
is the merge example from the previous section (where both species
and habitat
end up being categorical), and house_animals_df
contains just animals which live in a house, in our case cat
and dog
. Let’s try to find the average float_1
value for each of these species
.
>> house_animals_df.groupby("species")["float_1"].mean()
species
ape NaN
cat 0.501507
dog 0.501023
gorilla NaN
snake NaN
Name: float_1, dtype: float64
Something looks a bit off, we now get a bunch of null values in our groupby. By default when grouping by on categorical columns, pandas returns a result for each value in the category, even when not present in the data. This can be annoying (since it’s an unexpected change in behaviour), and it can also hurt performance if the datatype contains a lot of groups which aren’t present in the relevant dataframe – particularly if grouping on multiple different categorical column.
To get the result we want, we can pass observed=True
into the groupby call, this ensures that we only get groups for values in the data.
>> house_animals_df.groupby("species", observed=True)["float_1"].mean()
species
cat 0.501507
dog 0.501023
Name: float_1, dtype: float64
Resolved yes, but it’s yet another gotcha to keep us on our toes.
Categorical column index

This case is a little more niche but it adds a bit of colour to the kinds of unexpected problems that you can sometimes run into if you don’t keep your wits about you using categoricals. There are scenarios where you might move row values into columns, for example, the groupby-unstack
combo which is somewhat of a pro-gamer move.
>> species_df = habitat_df.groupby(["habitat", "species"], observed=True)["float_1"].mean().unstack()
>> species_df
species cat ape dog gorilla
habitat
house 0.501507 NaN 0.501023 NaN
jungle NaN 0.501284 NaN 0.501108
So we’ve got a dataframe now with the species in the columns, all looks normal so far, let’s add another column and see what happens.
>> species_df["new_col"] = 1
TypeError: 'fill_value=new_col' is not present in this Categorical's categories
Normally this code would be completely fine, we’re just trying to add a new column called new_col
which always has the value 1. Just glancing at the dataframe, nothing looks wrong and the error seems kind of confusing, which Categorical
is it talking about, we’re just trying to add a new column?
What happens is that the .unstack()
(which for the uninitiated, flips the index into the columns much like a pivot) moves the categorical index into the column index. It’s not possible to add an entry to a categorical index which isn’t in the categorical datatype already, hence the error. This isn’t immediately obvious though and you could be forgiven for scratching your head if you ever ended up here.
There probably isn’t a strong case for ever having a categorical column index, but if accidentally end up with one and/or you start to see strange errors similar to this it may be worth checking datatypes of all of the things you’re working with and make sure there’s nothing weird and categorical going on.
Conclusion
Pandas categorical dtypes are cool, and can have some good performance benefits. When adopting the use of categorical datatypes it’s important to be aware of how the datatype behaves in different common situations and especially important to make sure categoricals stay categorical throughout the flow of the program and don’t flip back to object
.
Here’s a high level summary of things to be mindful of:
- When operating on categorical columns, select options which operate on the categories in the datatype rather than the values in the series which contain the datatype. This should allow you to preserve the categorical nature and also improve performance.
- When merging on categorical columns, be aware that to preserve the categorical nature, the categorical types in the merge columns of each dataframe must match exactly.
- When grouping on categorical columns, by default you will get a result for each value in the datatype, even if it’s not present in the data, you can change this using
observed=True
in the.groupby
. - When things that you expect to work unexpectedly stop working, consider whether a strange interaction with categoricals may be at play
Best of luck, and happy catting!