Feature Engineering for Time Series Data

Time series aggregations on the modern data stack

Published in

Towards Data Science

4 min readMar 29, 2022

In many business data science problems, data with a time series character (e.g., transactions, sensor readings, etc.) must be aggregated to an individual (e.g., customer, piece of equipment, etc.) level. Doing this in a modern machine learning environment can lead to trouble as traditional train-test and cross-validation splits will fail. This is because randomly splitting observations between training and test sets will almost always split related (and hence correlated) observations between the two sets. More problematically, observations will be split so that some training observations can be taken after the test observations. In other words, the model will be learning to predict the past from the future. This often leads to models failing when placed into production.

There are numerous ways to deal with this issue, and I discussed my preferred method here. A commonly used method is to take a single observation from each individual. This can be done either by randomly selecting one observation for each individual or picking a date and using only the observations from this date.

We will focus on creating a single record for each observation on a given date. In this example, we will use weather data and aggregate it to show the average high temperature at the county level over multiple periods of time. This can be challenging to do in pandas, but we will show how easily it can be done on the modern data stack with Rasgo.

Single moment aggregations in pandas

Given a dataframe df that contains a county code FIPS, a date DATE, and the high temperature in a county on a date HIGH_TEMP, rolling and groupby can be used to create rolling averages for all days. In this case, rolling averages over one, two, and four weeks will be created.

First, the rolling window should be created

df['HIGH_TEMP_AVG_7D'] = df.sort_values(
                              by=['FIPS', 
                                 'DATE']).set_index(
                                 'DATE').groupby(
                                 'FIPS')[
                                 "HIGH_TEMP"].rolling('7D',             
                                 closed='both').mean()
df['HIGH_TEMP_AVG_14D'] = df.sort_values(
                              by=['FIPS', 
                                 'DATE']).set_index(
                                 'DATE').groupby(
                                 'FIPS')[
                                 "HIGH_TEMP"].rolling('14D',             
                                 closed='both').mean()
df['HIGH_TEMP_AVG_28D'] = df.sort_values(
                              by=['FIPS', 
                                 'DATE']).set_index(
                                 'DATE').groupby(
                                 'FIPS')[
                                 "HIGH_TEMP"].rolling('28D',             
                                 closed='both').mean()

The index can be reset to return the county code and date to being simple columns in the dataframe.

df.reset_index(inplace=True)

This approach has several issues. First, if the data is large, it can be quite time-consuming to download it to the workstation and process it. While this could be sped up by only downloading the data needed for the dates in question, additional computations would still be performed for all dates in the data. More importantly, only the aggregation to the current date is required for a production workflow and should be generated.

RasgoQL and the modern data stack approach

On the modern data stack, the open-source package RasgoQL resolves both of these problems in several ways. First, the processing is kept in the cloud data warehouse, so no data is moved to a workstation. Second, the RasgoQL transform timeseries_agg leverages the power of the data warehouse to work with much larger amounts of data than would be possible on a workstation.

To start, get a reference to the table in the data warehouse that holds the weather data.

dataset = rql.dataset('WEATHER_TABLE_NAME')

The rolling averages for every day can be generated with a call to the transform.

tsagg = dataset.timeseries_agg(aggregations={
                                'DAILY_HIGH_TEMP': ['AVG']
                               },
                               group_by=['FIPS'],
                               date='DATE'
                               offsets=[7, 14, 28],
                               date_part='day')

This data can be downloaded by calling to_df on tsagg to get a pandas dataframe for additional processing.

tsagg_df = tsagg.to_df()

Alternatively, this data can be published back to the cloud data warehouse to create a view that can be used by others both within RasgoQL and straight from the data warehouse through SQL.

tsagg.save(table_name="ROLLING_WEATHER_SNAPSHOT")

Using transformations from the open-source Python package RasgoQL on the modern data stack makes creating these sorts of rolling windows much simpler than they are to create within pandas. The package creates a SQL call that is executed in the database and the results can be saved back to the database to allow easy future reference within the existing project or in other projects that need the same data. By using SQL, the volume of data that is processed can be much higher. Finally, with pandas, using the aggregate function can allow multiple aggregations on multiple columns simultaneously. Except that only one time window (e.g. seven or fourteen days) can be considered in each step. This means to create a seven, a fourteen, and a twenty-eight day average, three different Python statements are needed. In RasgoQL, a single SQL call can generate all three simultaneously.

If you want to check out RasgoQL, the documentation can be found here and the repository here.

Feature Engineering for Time Series Data

Time series aggregations on the modern data stack

Single moment aggregations in pandas

RasgoQL and the modern data stack approach

Written by Andrew Engel