Improve Your Data Science Workflow with Rolling Functions in Pandas

A guide to Rolling Features in Pandas

Avi Chawla

Follow

Published in

Towards Data Science

5 min readJul 27, 2022

--

Much of the tabular data analysis we see today is driven by popular Pandas’ series-based methods, which take into account the entire data at once for analysis. These methods usually encompass evaluating series distribution using value_counts(), determining the unique values using unique(), finding the distribution of one column which is segregated based on the values in another column using groupby(), or generating a cross-tabulation of values from multiple columns using crosstab() etc.

However, another robust set of feature engineering methods often overlooked by Data Scientists is window-based data analysis. Here, instead of estimating a particular data statistic/measure on the entire data column, a pre-defined number of contiguous records (or a window of values) are considered for evaluation.

For instance, consider the dummy sales data below. We can compute the average three-day rolling sales for every sale day by aggregating three records simultaneously, as shown in the image.

A dummy dataset for store sales (Image by Author)

The mathematical formula to compute the three-day rolling average is depicted in the image below.

Formula to compute the three-day rolling average (Image by author)

This family of feature generation methods is especially pertinent to time-series data. Finding rolling average, rolling sum, rolling minimum, and rolling maximum are some of the most common applications of window-based feature engineering methods.

This post serves as an introductory guide to rolling features in Pandas. I’ll demonstrate the methods to compute rolling features with the help of a dummy dataframe.

You can find the notebook for this article here.

Let’s begin 🚀!

Dataset

For the purpose of this article, I have created a dummy daily store sales dataset of 7 days, starting from 7th Jan 2020 to 13th Jan 2020. Corresponding to each date, we know whether it is a weekend or a weekday. Moreover, I generated a random sale number corresponding to both cash and card modes of payment. This is implemented below.

The first five rows of the dummy sales dataset are shown in the image below:

The first five rows of the DataFrame (Image by author)

Rolling Functions

As elaborated above, window functions consider a subset of rows at a time to estimate a statistic/measure on the given data. In Pandas, this family of functions is accessible using the df.rolling() class. The type of python object created can be seen as follows:

Rolling Sum

As the name indicates, the rolling sum approach computes the moving sum over the pre-defined number of contiguous records. This is elaborated on the dummy dataset below:

With window=2, we intend to sum the “Sales” column over 2 contiguous records. Also, note that the first value of the rolling sum column is NaN.

If you want to consider only the first record for the computation of the rolling sum on the first record, pass min_periods=1 in the definition of rolling class as shown below:

Result of the Rolling sum over a window size of 2 with min_periods (Image by author)

The min_periods argument specifies the minimum number of observations in the current window required to generate a rolling value; otherwise, the result is NaN.

Rolling Sum Over Days

Next, let’s say you want to compute the rolling sum over the last two days instead of the previous two records as discussed above.

Obtaining the rolling sum over days instead of records is extremely relevant in this situation. This is because we have multiple records corresponding to the same date. Therefore, if you want to find the sales over a multiple-day window, you can do so as follows:

Result of the Rolling sum over two days (Image by author)

The argument window=’2d’ is interpreted as “2 days” by the rolling class. Similarly, if you want to obtain the rolling sum over 2 seconds, you can specify window=’2s’.

Here, if you just want to filter the last record corresponding to each date, you can do that using the groupby() method as follows:

Filtering the rolling sum to obtain the last record for each day (Image by author)

Grouped Rolling

Lastly, say you want to generate the rolling sum separately for the card modes of payment and cash. In other words, the rolling sum is expected to be grouped on a column. This is also achievable by adding a layer of groupby() to the DataFrame and calculating rolling features on the individual groups. This is demonstrated below:

Result of the Rolling sum over a window size of 2 grouped on Payment_type (Image by author)

Additional Rolling Methods

Similar to the sum-based window operation discussed above, Pandas allows you to perform various other rolling functions as well. These include rolling mean, standard deviation, minimum and maximum, etc. You can find all the supported methods here. Let’s discuss one of them below.

Rolling Mean

Rolling Mean, also known as moving average, calculates the mean of the values within the window of values. Moving average is a popular data analysis technique used in Data Science for time-series data.

In terms of the syntax in Pandas, everything we discussed above remains the same, except that we replace sum() with mean() to find the moving average. This is demonstrated below.

Note: To demonstrate the moving average, I created three months of dummy data, starting from 1st Jan 2020 to 31st Mar 2020.

Result of the moving average over a window size of 6 with min_periods = 1 (Image by author)

You can plot the moving average using Matplotlib as shown below:

Line chart depicting the moving average and the actual values (Image by author)

To conclude, in this post, we discussed rolling features in Pandas and how you can use them in your next tabular data analysis for time-series analysis.

To know about more methods available within the “Window” module of Pandas, I would highly recommend checking the official documentation of Pandas here.

As always, thanks for reading. Mentioned below are some of my other articles which you may enjoy reading next :)

Improve Your Data Science Workflow with Rolling Functions in Pandas

A guide to Rolling Features in Pandas

Dataset

Rolling Functions

Rolling Sum

Rolling Sum Over Days

Grouped Rolling

Additional Rolling Methods

Rolling Mean

20% of NumPy Functions that Data Scientists use 80% of the Time

Who said you should know everything?

Five Killer Optimization Techniques Every Pandas User Should Know

A step towards data analysis run-time optimization

Written by Avi Chawla