Q&A for Ingesting Historical Data into SageMaker Feature Store

Answering questions on how to ingest data into SageMaker Offline Feature Store by directly writing into S3

Heiko Hotz
Towards Data Science

--

Photo by Emily Morter on Unsplash

In a previous blog post I showed how to ingest data into SageMaker Offline Feature Store by writing directly into S3. I have received feedback and suggestions for advanced scenarios which I will discuss in this Q&A.

Q: How can I ingest historical data where feature records have different timestamps?

I simplified my previous example by assigning each feature record the same timestamp. However, in a real-word scenario it is much more likely that historical feature records have different timestamps. In that case we can use the same approach, but we have to be a bit more sophisticated when it comes to setting up the S3 folder structure and we have to split the dataset according to their timestamps.

Let’s start by creating a feature dataset with different timestamps per record:

This code creates appends random timestamps between 1 Jan 2021, 8pm and 2 Jan 2021, 10am to the dataset.

The documentation on the S3 folder structure for the Offline Store tells us that we have to create a different folder for each unique combination of year, month, day, and hour of those timestamps. It also tells us that the filename for each feature subset requires the timestamp of the latest timestamp in the file:

Naming convention for S3 (Image by author)

To accomplish this we need to create a key for each record in the dataset. This key will be in format YYYY-MM-DD-HH, representing the year, month, day, and hour of the timestamp for this record. We then group together all feature records with the same keys:

Splitting feature data according to the timestamps (Image by author)

For each subset we also need to identify the corresponding filename. To do so, we need to identify the latest timestamp within each subset. In the example shown above, the filename for subset with key 2021–01–01–22 would start with “ 20210101224453_”, since the latest entry in this subset is from 22:44:53.

The following code generates the keys as well as S3 paths and filenames for each subset:

To split the dataset according to their timestamp keys and save them to S3 in the corresponding S3 path we can simply leverage the groupby() method of pandas:

Conclusion

In this example we have ingested historical feature records with different timestamps into SageMaker Offline Feature Store. We have split the data according to the feature record timestamps, created S3 paths according to the documentation, and stored each subset in its corresponding S3 location. A complete example with code can be found in this notebook.

Q: I have several versions for each feature record. I want to ingest the historical feature records into the Offline Store, but also have the latest version of each feature record synced with the Online Store. How do I do that?

In this scenario we will have to identify the latest version of each feature record based on the event timestamp. We will then backfill all versions older than the latest one by writing those records directly into S3. The subset with the latest records we will ingest using the regular ingestion API. In the end we will have the latest version of each record in the online and the offline feature store. All other (historical) records will be available in the offline store only.

Let’s start by creating a dataset to reflect this scenario. The code below creates 3 records per transaction, each with a different timestamp:

The resulting dataset has 6,000 records, three for each transaction. Now we want to split the data into two groups:

The first subset (df_online) contains the latest version for each transaction. This one we will ingest using the API call. The second subset (df_offline) contains the older versions of each feature record. This one we will ingest directly into S3 in the same way as above.

Splitting feature data into historical and current subsets (Image by author)

Because the code for populating the offline store is the same as above, I won’t go through it again here, but you can find it in this notebook. One difference I want to point out explicitly is that when creating the feature group we need to make sure that the online store is enabled:

Once the S3 backfilling is complete we can just ingest the latest version for each feature record with the API. This will write the subset into the online store as well as the offline store:

Writing to the online store is immediate and we can test it right away by calling the GetRecord API:

Writing to the offline store via the API takes a few minutes. After waiting for ~5–10 minutes, we can test whether the offline store is now populated correctly:

If everything worked correctly, you should see three records, two of which we have ingested by writing to S3 directly and one via the API:

Image by author

Conclusion

In this section we have backfilled historical feature records into the offline store by writing directly to S3. We have also synced the current version of the feature records with the online store by using the ingestion API. In the end we have all feature records in the offline store and the most current version in the online store. The entire code for this example can be found in this notebook.

Thanks to everyone who reached out with feedback and suggestions. And if you still have any questions or feedback, please feel free to reach out.

--

--