Why You Should Probably Never Use pandas inplace=True

Truly it is a curse on the library and a pox on thee if you use it

Sven Harris
Towards Data Science

--

This article will explain what the pandas inplace=True keyword means, how it behaves, and why you should probably never use it.

Pandas is a big library; there are many different ways in which you can write a program using pandas to arrive at the same result. However, each different approach may have wildly different characteristics in terms of performance, maintainability, and of course code style points. I’ve penned a couple of other articles covering some other areas of the pandas library, check them out if it’s something you care about:

This piece will be covering the controversial inplace=True which (spoiler alert) gives you negative cool points if you use it in your code.

Panda onplaice=True was the best illustration I could think of. I can only apologize. (Credit for the panda goes to Lucy Sheppard), rest of the monstrosity is my own.

Introduction — What is inplace=True?

This won’t be news to you if you’ve got experience using the inplace keyword, but just a quick recap of how it works. Inplace is a parameter accepted by a number of pandas methods which affects the behaviour of how the method runs. Some examples of where you might commonly see this keyword (but hopefully not implemented in your own code) are the methods; .fillna() , .replace() , .rename() , the list goes on.

inplace=True

Using the inplace=True keyword in a pandas method changes the default behaviour such that the operation on the dataframe doesn’t return anything, it instead ‘modifies the underlying data’ (more on that later). It mutates the actual object which you apply it to.

This means that any other objects referencing this dataframe (such as slices) will now see the modified version of the data — not the original.

Imagine having an ice-cream, it’s frozen, but then you want to melt it. If you use inplace=True you change the state of the object. You can’t get your unmelted ice-cream back.

Below is a quick example to demonstrate how that might look in pandas:

>>> ice_cream = pd.DataFrame({
"state": ["frozen"], "flavour": ["vanilla"]
})
>>> ice_cream
state flavour
0 frozen vanilla

Now let’s do the melt (which in this case is a simple string .replace() not to be confused with pandas .melt()).

>>> ice_cream.replace({"frozen": "melted"}, inplace=True)
>>> ice_cream
state flavour
0 melted vanilla

The ice-cream has turned to gloop!

When we melt inplace we change the underlying structure of our ice-cream. Image by author.

inplace=False

Alternatively, when using inplace=False (which is the default behaviour) the dataframe operation returns a copy of the dataframe, leaving the original data intact. We are no longer constrained by the laws of physics, we can have our cake and eat it too. Let’s see…

Because someone melted our ice-cream we’re going to have to re-make it first…

>>> ice_cream = pd.DataFrame({
"state": ["frozen"], "flavour": ["vanilla"]
})
>>> ice_cream
state flavour
0 frozen vanilla

Now let’s melt it again, but this time using inplace=False (we don’t strictly have to pass it in, as it’s the default option, but have done so below for illustration purposes).

>>> melted_ice_cream = ice_cream.replace({"frozen": "melted"}, inplace=False)
>>> ice_cream
state flavour
0 frozen vanilla

So when we look back at our original ice-cream we see that it isn’t melted, but we do have a melted ice-cream, now we have both; twice as much ice-cream…

>>> melted_ice_cream   state   flavour
0 melted vanilla
When inplace=False a copy of the data is first made, so that the original ice-cream/data remains in tact. Image by author.

As mentioned it’s not just .replace() where the keyword parameter is available, you’ll see it in a whole host of different methods, such as .fillna() , .sort_values() , .query() , .drop_duplicates() , .reset_index() , to name but a few.

The motivation for using inplace=True

There’s a number of reasons why people reach for inplace=True , I’ve tried to characterise many of the things I have heard over the years in the following points.

  1. I don’t need the intermediate results, only the final output. Why would I want to keep producing copies with redundant data? The program should be like an ice-sculptor, chipping away at the same block of ice until the sculpture is complete!
  2. My computer has finite memory — isn’t it more efficient and faster to modify my dataframes in place without expensive copies?
  3. I don’t want to have to come up with variable names for all of these intermediate steps…
  4. I didn’t really think about it, I’ve just picked it up as a habit.

Feel free to comment if you think there are any other reasons you might want to use inplace=True , and I’ll add them to the list (hopefully with an appropriate rebuttal).

A response to the motivation

I don’t need the intermediate results, only the final output

This may be true in the happy path (when everything goes right), however during development/testing/debugging you probably do want to inspect some intermediate values at different points in the pipeline, that’s easier to do when the state of your data doesn’t change over time.

One of the real dangerous parts of this pattern that can easily let nasty bugs creep in is the mutation of objects. Code that mutates the state of objects is described as having ‘side-effects’ because the act of running the code changes the state of the system in some way; in this case inplace=True has the side-effect of modifying the original dataframe. Let’s look at a trivial example of how this might lead to issues.

Let’s imagine we have a table, with sales per city and we want to produce a leaderboard of the top selling cities, as well as calculating the total amount of sales across all cities.

def create_top_city_leaderboard(df):
df.dropna(subset=["city"], inplace=True)
df.sort_values(by=["sales"], ascending=False, inplace=True)
return df
def calculate_total_sales(df):
return df["sales"].sum()
df = pd.DataFrame(
{
"city": ["London", "Amsterdam", "New York", None],
"sales": [100, 300, 200, 400],
}
)

It shouldn’t matter what order we run these two tasks in because they are in theory completely independent. Let’s have a go at running these functions and see what happens.

>>> df = pd.DataFrame(
{
"city": ["London", "Amsterdam", "New York", None],
"sales": [100, 300, 200, 400],
}
)
>>> calculate_total_sales(df)
1000
>>> create_top_city_leaderboard(df) city sales
1 Amsterdam 300
2 New York 200
0 London 100

All looks ok up to this point, but let’s see what happens if we calculate the total sales again:

>>> calculate_total_sales(df)
600

Damn, we just lost 400, that was an expensive mistake! If you were writing some code which called this function and you weren’t familiar with the code inside create_top_city_leaderboard you would rightly be pretty upset that it clobbered your dataframe and caused a bug later in your code — that’s a nasty side effect.

This might seem like a contrived example, but in more complex code, this anti-pattern is very prone to springing up if you depend on mutable state. There’s often multiple things you want to do with a dataframe, and it’s very hard to write safe code that does that if you have to think not just about what the dataframe is, but what state the dataframe will be in. If you never mutate the state of your objects you can guarantee that they will be in exactly the same state at all points in time, which makes the behaviour of programs much easier to understand and reason about.

For completeness here is a non-mutating approach to the same function:

def create_top_city_leaderboard(df):
return (
df.dropna(subset=["city"])
.sort_values(by=["sales"], ascending=False)
)

Isn’t it more efficient and faster to modify my dataframes in place without expensive data copying?

I was a little surprised when I found out the answer to this one, actually under the hood in most cases a copy is still created, the operation carried out and then as a final step the previous reference overwritten with the new transformed copy. This means that in most cases using inplace=True is no more efficient.

I think this is probably the most common and harmful misconception when using inplace=True , when it’s done in the name of performance, but offers none and brings additional downsides.

Note: There are some cases where it does offer some performance benefit by avoiding copies, however to know whether a specific method offers any benefits with the inplace=True parameter you will probably need to check the pandas source code, and if you’re relying on having to check the library source code it probably won’t be clear and obvious in your application code the intention behind your decisions.

There’s dozens of ways of optimizing pandas code before even attempting inplace in the name of performance.

I don’t want to have to come up with variable names for all of these intermediate steps…

If you use chaining (which gives you major pandas style points), then you won’t have to!

inplace=True prevents the use of chaining because nothing is returned from the methods. That’s a big stylistic blow because chaining is where pandas really comes to life.

Let’s compare a miserable inplace example vs some beautiful chaining:

def create_country_leaderboard(df):
country_df = df.groupby("country")[["sales", "refunds"]].sum()
country_df.rename(index=str.lower, inplace=True)
country_df.reset_index(inplace=True)
country_df.sort_values(by="sales", inplace=True)
return country_df

Pretty gross, I know. Now let’s see how we can spruce that up, also without having to come up with four new variable names.

def create_country_leaderboard(df):
return (
df.groupby("country")[["sales", "refunds"]]
.sum()
.rename(index=str.lower)
.reset_index()
.sort_values(by="sales")
)

Oh yeah lovely, reads like a dream now.

I didn’t really think about it, I’ve just picked it up as a habit

Fair enough, now you have thought about it you can finally put the habit to rest.

Additional clout to stop using inplace=True

  • If you forget to add the inplace=True to one of your lines (I did this when writing one of the examples), you might miss an operation on your dataframe and it can be hard to spot, because you’ll have a random line just not doing anything useful.
  • If you use Jupyter notebooks, it will make it even harder to manage the state of different objects — if a function that mutates your dataframe ends up erroring half way through the function (also happened to me when writing an example), you will end up with a half mutated dataframe.

Additionally, the core pandas dev team recommend against using this parameter and have spoken about plans to deprecate the behaviour (I’ve got the champagne on ice).

There’s a lengthy and interesting discussion on GitHub about this topic (here) — so if you don’t take my word for it, please take their’s.

Conclusion

Often in software engineering there are trade-offs to be made when making design decisions about the code you write. In the case of inplace=True there are approximately zero benefits to using this pattern, but a significant number of costs:

  • It encourages a dangerous stateful, side-effecty style of coding which is likely to cause bugs.
  • It doesn’t behave like it’s widely expected to behave (and only very rarely offers any performance improvement at all).
  • It removes the ability to use chaining, which is (in my opinion) the sweet spot for writing beautiful pandas code.
  • The parameter will likely eventually be deprecated (so you may as well get used to not using it already).

In conclusion, inplace=True is best avoided!

--

--

Data Scientist/Python Engineer from the UK. Living in Amsterdam, working in payments.