So you’ve done it, you’ve got a nice time series with helpful features in a Pandas
DataFrame. Maybe you’ve used pd.ffill()
or pd.bfill()
to fill in empty time steps using the previous or next value and perform analysis or feature extraction on your full series.

What do you do?
We faced this problem today at Mansa, where we were saving hundreds (if not thousands) of unnecessary rows after our pre-processing pipeline was completed.
Since we deal with financial data, we want to be able to tell the balance for an account at any point in time and to calculate statistics over number of days and through time. As a result, our pre-processing includes a step to resample our working DataFrame in order to fill all dates with missing information. The issue here was that we were saving this filled-in DataFrame with the resampled dates and filled-in values instead of compressing it back.
TL;DR: Skip to the end gist for the most efficient way!
To proxy our real data, I’ve made this simple example generator that will spit out a filled up time-series. The date
column here is the column of daily timestamps that will be generated and filled to be complete (through resampling).
As a simple example, assume we have a user with an account that dates back to early 2018. This user uses it as a secondary account, and so only has a transaction every two days or so. Over three years, they only have 500 days where the balance actually changes.
Say that in order to plot or analyse their balance over time we want to know it at the end of day mark, and so have to expand back the series of 500 observations to the full three years (resampling). Since balance remains in the account until it is observed to change, then we can fill the values of the days after an observation with the balance of that observation, until the next observation(ffill).
So if I had a 5€ balance on Thursday night and 500€ on Sunday night, then I know that I also had 5€ on Friday night and Saturday night, since observations only happen when there is a change in the balance.
With about 500 random observations in three years, if we resample the dates to get the full daily DataFrame, we will get from 500 rows to around ~1k (365*3).
Why not simply remove duplicates?
Because if a day has the same balance as another day occurring much later, a naive drop_duplicates
will ignore the temporal dimensionality of your data.
Depending on how small are your time steps and how sparse are your observations, you can have a lot of repeated or empty rows in your DataFrame.
So, how to compress it back?
- First naive solution:
My first approach was to simply iterate over the sorted DataFrame to detect whenever a row is different to a previous one (ignoring the time step column).
In this case, I simply iterate over the rows in the DataFrame and find all indexes where a change happens between the time step i and i-1.
This works, but iterrows
is not fast.
Timing the block of code with %%timeit
and my small generated DataFrame I get:
2.39 s ± 794 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
We can do better!
- Taking a stab at using
reduce
:Ok so
iterrows
is slow, what can we do to make it faster. My second thought was to work directly with the numpy values array and usereduce
from the standard library in order to efficiently go over two rows at a time.
I first find the column indexes to compare (again ignoring the time step column "date") and then have a reduce function that appends a row if it is different from a previous one.
We can then create a DataFrame back with the new values.
This was already much faster (100x improvement)!
20.4 ms ± 6.26 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
But can we make it better?
- Back to
pandas
:
The next big realization I had was that since I only cared about the previous time step, I could simply shift the entire table to compare all values simultaneously!
This simplifies the code greatly. I still only want to compare all columns excluding "date", so I shift the selection of columns by one and compare to the original. With ~(...).all(axis=1)
I can compare the two row wise in order to select the rows in the original df
that are not the same as the previous row.
This is even faster (not a surprise since we are not iterating through anymore)! This is 4x faster than using the reduce
approach and 400x faster than the naive iterrows
!
4.61 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It also has the benefit of fitting on 2 lines of code!
- To go even faster we can implement this in
numpy
:This does the same as the pandas shift essentially.
We construct the shift by using the np.roll
function and setting the first element as nan. Again, we compare all values against the shifted array row wise.
This is again faster!
973 µs ± 29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Nearly 5 times faster than the previous approach and 2000x faster than the naive approach!
Note: For all of the solutions the DataFrame was assumed to be sorted by the time step (in this case "date") column. These solutions would not work otherwise.
Conclusion
Avoid iterrows
, use shift
.
In the end I’ve opted for solution #3 and not the most performant since it is less lines of code and probably clearer to understand. Sometimes the simplicity of a solution is more important than its performance gains.
I am sure we could push the performance even more and I challenge you, the reader, to try it!
I hope that you learned something or that this has helped you tackle a similar problem. This is my first blog entry so apologies if a bit unstructured, I’d really appreciate any feedback and if you think you have a different way of doing this don’t hesitate let me know!