The world’s leading publication for data science, AI, and ML professionals.

ML Engineering with DynamoDB

How to leverage this powerhouse NoSQL database for online inference

Why consider DynamoDB?

When working with offline, batch ML systems, SQL-based workflows against a data warehouse like Snowflake constitute the backbone of business analytics, descriptive statistics, and predictive modeling. This is an ideal state, where complex transformations can be pushed to distributed database engines and features are defined in the same language where they’re marshaled for inference.

Avoid leaving this eden, unless you’re pushed out by the (real) requirement for real-time inference against real-time features. If you find yourself in this new, messy world, especially where relevant features change at high-volumes and you need low-latency reads and writes, you might need NoSQL.

AWS DynamoDB is an extremely scalable, managed NoSQL database. Like other NoSQL databases, it trades consistency for availability, making it an ideal data store on top of which to build a low-latency ML predictor. The goal of this post is to demonstrate some design patterns with which you can build production workflows flexible enough to handle common ML access patterns.

Imagine generated by author on https://imgflip.com/
Imagine generated by author on https://imgflip.com/

Primer on indexes in DynamoDB

Key values stores are fast, because the access pattern is simple. Elements in a set are accessed by a primary key. In Dynamo this is a combination of partition and sort keys, which must be unique per row. The partition and sort key together comprise a compound primary key and determine respectively (and intuitively) where and in what order the data will be stored.

Local secondary indexes operate within the confines of a single partition. They’re useful in cases where you want to sort or filter a partition in multiple ways.

Global secondary indexes do not need to share the same partition key as the primary index. They can help create flexible and efficient access patterns, within a single table, as we’ll see in part 2 of our ML product scenario.

GSIs are updated with eventually consistency, which means a write can succeed to Dynamodb before all indexes are updated. This is an intentional design tradeoff and part of what makes DynamoDB so powerful. When storing high-throughput data (e.g. telemetry stuff, high-volume transactions), writing quickly and possibly dirtily is a good tradeoff.

Inference for a single customer using a time series feature

Imagine you have a model, which predicts whether a customer will make a purchase given their last 30 days days of history. Your product manager asks you to use this model to determine whether a promotional banner should be displayed on the website during a browser (why not; I’m not a PM).

How can DynamoDB support this prediction pipeline?

  1. We design a table that supports efficient access to the features we need per customer.
  2. We write purchases in real-time (likely through some queue service) to the table.
  3. An inference application queries this table upon receiving a request from the website.

To start, let’s illustrate what the data and feature engineering would like like in a more familiar relational model. At inference time, the model needs to access the past 30 days of purchases for the customer.

How can you efficiently replicate this set up in DynamoDB? The model predicts at the customer level, using features for that customer, so a natural (and efficient) place to start is to partition the data on the customer_id. DynamoDB will store data for each customer ID continuously; within each primary key partition, we want to efficiently fetch the purchases relevant to our prediction. A local secondary index allows us to do just that.

Let’s use the values in this table in a real-time inference application. The mocked out model here is a logistic regression with the single, self-explanatory feature _total_purchase_volume30d. We serve the prediction to the front end through a FastAPI endpoint, which marshals the data from DynamoDB and feeds it into a predictive model.

Extending this to multiple predictions at once

So far so good. We’re able to hit a single API endpoint with a customer ID and access that customer’s prediction with low latency. In fact, your PM is so pleased with the performance of the promotion, she wants to expand it to an email blast. She wants each blast to target customers who are likely to purchase in the next 30 days, segmented by customers who have made purchases in particular categories. One email blast may target jewelry shoppers and another funky shirts (winging it, here).

The ML pipeline now has to support returning predictions for all customers who have purchased in each category . The problem is there are a ton of customers; let’s say tens of millions. Furthermore, the majority haven’t made a purchase in the last 30 days. Our current data design necessitates a scan over all customers, filtering for transactions in the category we care about. This sort of scan on a large table is exorbitantly expensive, as it requires reading each item in the table; and exorbitantly annoying as you’ll need to paginate through tons of results.

Luckily, we can extend the table with a Global Secondary Index (GSI) to support this use case efficiently. GSI data is stored independently from the primary index and, therefore, does not need to share the same partition. This is key in supporting our use case: our strategy is to query for all items in the table under a particular category, filter out the relevant customers, and then leverage the primary index to make our predictions. Note: we leverage the "ProjectionType" attribute to limit the amount of data stored in the secondary index (we only need customer IDs in this scenario). This will keep costs down and speed up reads and writes.

The prediction loop below implements the strategy outlined above. A simple query against the GSI quickly returns all purchases in the table in the category specified by the API parameter category_name. We deduplicate the customer IDs returned and then leverage them in a BatchGetItem request (which cuts down on network round trip time) to get all purchases for those customers, using the primary key and local secondary index. The feature marshaling and prediction can then take place entirely in memory.

Conclusion

Through two ML product use-cases, we’ve seen how to leverage local and global indexes in DynamoDB to efficiently store and access feature data. You can even get fancier, simulating a full set of relations via secondary indexes.

Feature engineering in SQL is more intuitive and less verbose, so use it where you can. But DynamoDB’s low-latency, eventually consistent model makes it an attractive option for high-volume online inference.


Related Articles