Ingesting Historical Feature Data into SageMaker Feature Store

How to backfill the SageMaker Offline Feature Store by writing directly into S3

Heiko Hotz
Towards Data Science

--

Photo by Jan Antonin Kolar on Unsplash

What is this about?

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, update, retrieve, and share machine learning (ML) features. It was introduced at AWS re:Invent in December 2020 and has quickly become one of the most popular services within the SageMaker family. Many AWS customers I have spoken to since re:Invent expressed their interest in SageMaker Feature Store (SMFS).

Some of these customers have historical feature data they would like to migrate to the SMFS offline store which can store large volumes of feature data that is used to keep track of historical feature values and to create train/test data for model development or by batch applications.

A major challenge when ingesting historical data into SMFS offline store is that users will get charged when using the ingestion APIs, even if they only want to backfill historical data. These charges can grow quickly if customers have Terabytes of data they want to migrate to SMFS.

In this blog post I show how to write historical feature data directly into S3, which is the backbone of the SMFS offline store. Using this methodology circumvents the ingestion APIs and saves costs substantially.

Q: I’m only here for the code, where can I find it?

A: Here you go :)

A quick back-of-the-envelope cost calculation

This notebook is a good example showing how to ingest data into SMFS using the dedicated ingestion API. According to the Amazon SageMaker Pricing website, users will be charged $1.25 per million write requests units (this is for region US East N. Virginia — different charges may apply in different regions).

Pricing for Sagemaker Feature Store

One write request is equivalent to 1KB of data, so therefore each Gigabyte (= 1 million KB) costs $1.25 to write into SMFS. I have spoken to customers that have Terabytes of historical feature data who just want to backfill the SMFS offline store. They would be charged thousands of USD if they were to use the API for backfilling this data. In contrast, putting files directly into S3 is much cheaper ($0.005 per 1,000 files in US East N. Virginia).

How to write feature data directly into S3

The game plan is straightforward: We will create a feature group, amend the corresponding dataset, and save it directly in S3 according to the SMFS offline store data format. As a result this dataset will then become available in the SMFS offline store.

Preparing the data

For this exercise I will use a synthetic dataset that represents credit card transactions. The dataset is publicly available in the SageMaker Sample S3 bucket:

Some of the columns are of type object and we need to convert them to strings so that SMFS accepts the data (see this page for SMFS supported data types):

Each dataset in SMFS requires a timestamp for each data point. Since our dataset doesn’t have a timestamp, we need to add it manually:

Creating a feature group

The dataset is now ready and we can create the corresponding feature group. To do so, we first define the feature group like so:

Now we are ready to create the feature group. Note that creating the feature group is different from ingesting the data and will not incur charges. The feature group will be registered with SMFS, but it will remain empty until we fill it with data. We also need to provide the name of the column that uniquely identifies a record, i.e. the primary key and the name of the column that contains the timestamp for each record:

Up until this point we have followed the same steps as as you would when ingesting data into SMFS via the API. Now we will ingest our data directly into S3.

Writing the data into S3

The key to writing data into S3 so it becomes available in SMFS is this documentation. We are in particular interested in the naming convention that is used to organise files in the offline store:

Naming convention in S3 for offline store

To reconstruct this corresponding S3 file path we only need the table name, as well as year, month, day, and hour of the event timestamp we created earlier:

It is important to note that additional fields will be created by SMFS when using the ingestion API:

Additional fields for the dataset

Because we aren’t using the API, we need to add those additional fields manually. The two timestamps (api_invocation_time and write_time) are different from the event timestamp we created earlier. However, for demonstration purposes it’s OK to reuse the same timestamp:

To create a valid file name for the data we can concatenate the event time and a random 16 alpha numeric code. As a final step we can now save the data as a parquet file in S3:

Checking if the ingestion was successful

The SMFS offline store is accessed via Athena queries, so the quickest way to check if the data ingestion was successful is to write an Athena SQL query that retrieves the data from the feature store:

If the ingestion was successful, the dataset retrieved from feature store will be the same as the dataset we have uploaded.

Conclusion

We have successfully backfilled the SageMaker Offline Feature Store with historical data without using the ingestion API. We did so by amending the dataset, setting up the appropriate S3 folder structure, and uploading the dataset directly to S3.

This methodology allows users to save the ingestion API charges, which can be substantial for large amounts of historical data.

[EDIT on 29/04/2021: A follow-up blog post discussing more advanced scenarios can be found here]

--

--