The world’s leading publication for data science, AI, and ML professionals.

5 New Features in pandas 1.0 You Should Know About

Dynamic window functions, faster apply and more.

Panda meme from imgflip.com
Panda meme from imgflip.com

pandas 1.0 was released on January 29, 2020. The version jumped from 0.25 to 1.0 there aren’t any drastic changes as some pandas users expect. The version increase merely echoes the maturity of the data processing library.

While there aren’t many groundbreaking changes, there are a few that you should know about.

1. Dynamic window size with rolling functions

Rolling window functions are very useful when working with time-series data (eg. calculation of moving average). The previous version of pandas required that we pass the window size parameter, eg. calculate moving average on 3 periods. With pandas 1.0 we can bypass this requirement as we show in the example below.

Let’s calculate the moving average of values until the current number is not greater than 10. First, we create a DataFrame with 3 values greater or equal than 10.

df = pd.DataFrame({'col1': [1, 2, 3, 10, 2, 3, 11, 2, 3, 12, 1, 2]})
df

Window function should expand until a value greater or equal to 10 is not reached.

use_expanding =  (df.col1 >= 10).tolist()
use_expanding 
# output
[False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False]

For dynamic size window functions, we need to implement a custom indexer, which inherits from pandas BaseIndexer class. BaseIndexer class has a get_window_bounds function, which calculates the start and end for each window.

from pandas.api.indexers import BaseIndexer
class CustomIndexer(BaseIndexer):
    def get_window_bounds(self, num_values, min_periods, center, closed):
        start = np.empty(num_values, dtype=np.int64)
        end = np.empty(num_values, dtype=np.int64)
        start_i = 0
        for i in range(num_values):
            if self.use_expanding[i]:
                start[i] = start_i
                start_i = end[i] = i + 1
            else:
                start[i] = start_i
                end[i] = i + self.window_size
        print('start', start)
        print('end', end)
        return start, end
indexer = CustomIndexer(window_size=1, use_expanding=use_expanding)

We put the indexer class in the rolling function and we calculate the mean for each window. We can also observe the start and the end indices of each window.

df.rolling(indexer).mean()

2. Faster Rolling apply

Pandas uses Cython as a default execution engine with rolling apply. In pandas 1.0, we can specify Numba as an execution engine and get a decent speedup.

There are a few things to note:

  • Numba dependency needs to be installed: pip install numba,
  • the first time a function is run using the Numba engine will be slow as Numba will have some function compilation overhead. However, rolling objects will cache the function and subsequent calls will be fast,
  • the Numba engine is performant with a larger amount of data points (e.g. 1+ million),
  • the raw argument needs to be set to True, which means that the function will receive numpy objects instead of pandas Series to achieve better performance.

Let’s create a DataFrame with 1 million values.

df = pd.DataFrame({"col1": pd.Series(range(1_000_000))})
df.head()

some_function calculates the sum of values and adds 5.

def some_function(x):
    return np.sum(x) + 5

Let’s measure execution time with the Cython execution engine.

%%timeit
df.col1.rolling(100).apply(some_function, engine='cython', raw=True)
4.03 s ± 76.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Cython needed 4.03 seconds to calculate the function. Is Numba faster? Let’s try it.

%%timeit
df.col1.rolling(100).apply(some_function, engine='numba', raw=True)
500 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We see that Numba is 8 times faster with this toy example.

3. New NA value

pandas 1.0 introduces a new experimental pd.NA value to represent scalar missing values.

I know what you are thinking – yet another null value? There are nan, None and NaT already!

The goal of pd.NA is to provide consistency across data types. It is currently used by the Int64, boolean and the new string data type

Let’s create a Series of integers with None.

s = pd.Series([3, 6, 9, None], dtype="Int64")
s

What surprises me is that the NA == NA produces NA while np.nan == np.nan produces False.

s.loc[3] == s.loc[3]
# output
<NA>
np.nan == np.nan
# output
False

4. New String type

pandas 1.0 has finally a dedicated (experimental) string type. Before 1.0, strings were stored as objects, so we couldn’t be sure if the series contains just strings or it is mixed with other data types as I demonstrate below.

s = pd.Series(['an', 'ban', 'pet', 'podgan', None])
s

Storing strings as objects become a problem, when we unintentionally mix them with integers or floats – data type stays object.

s = pd.Series(['an', 'ban', 5, 'pet', 5.0, 'podgan', None])
s

To test the new string dtype we need to set dtype=’string’.

New string data type returns an exception with integers or floats in the series. Great improvement!

s = pd.Series(['an', 'ban', 'pet', 'podgan', None], dtype='string')
s

5. Ignore index on a sorted DataFrame

When we sort a DataFrame by a certain column, the index also gets sorted. Sometimes we don’t want that. In pandas 1.0, sort_values function takes ignore index, which does as the name of the argument suggests.

df = pd.DataFrame({"col1": [1, 3, 5, 2, 3, 7, 1, 2]})
df.sort_values('col1')
df.sort_values('col1', ignore_index=True)

Conclusion

These were the 5 most interesting pandas features based on my opinion. In the long term, new NA for missing values could bring a lot of clarity to pandas. Eg. how functions handle missing values, do they skip them or not.

There is also a change in deprecation policy:

  • deprecations will be introduced in minor releases (e.g. 1.1.0),
  • deprecations and API-breaking changes will be enforced in major releases (e.g. 2.0.0). Should we upgrade or stay with the current pandas version

The new deprecation policy makes this question: "should I update pandas?" easier to answer. It also seems that we can expect more frequent major releases in the future. To learn more about new features in pandas 1.0 read What’s new in 1.0.0.


Related Articles