The world’s leading publication for data science, AI, and ML professionals.

Feature Stores need an HTAP Database

A Feature Store is a collection of organized and curated features used for training and serving Machine Learning models. Keeping them up…

Image via Cybrain/Adobe Stock under license to Zer0 to 5ive
Image via Cybrain/Adobe Stock under license to Zer0 to 5ive

A Feature Store is a collection of organized and curated features used for training and serving Machine Learning models. Keeping them up to date, serving feature vectors, and creating training data sets requires a combination of transactional (OLTP) and analytical (OLAP) database processing. This kind of mixed workload database is called HTAP for hybrid transactional analytical processing.

The most useful Feature Stores incorporate data pipelines that continuously keep their features up to date through either batch or real-time processing that matches the cadence of the source data. Since these features are always up to date, they provide an ideal source of feature vectors used for inferencing. Feature vectors can even be delivered in real-time from the Feature Store. A complete Feature Store keeps history of feature values and uses it to create time-accurate training data sets.

A Feature Store in production that keeps feature values up to date, enables ML models to be quickly moved into production. Typically, features are grouped based on common context (i.e. customer, product, location) and data cadence. Data pipelines that feed each group with the appropriate cadence keep features up to date. With curated features readily available, many inference engines can access them without any additional effort.

Feature Store Data Flow
Feature Store Data Flow

In the diagram above, we use orange to indicate low latency/high concurrency processing that is usually characteristic of OLTP databases. We use blue to indicate high volume data processing found in massively parallel processing database engines usually referred to as OLAP. Let’s examine why we need both to run a feature store at maximum potential.

Batch data pipelines occur periodically (typically once a day or weekly). They process large amounts of source data by extracting, loading, cleansing, aggregating and otherwise curating data into usable features. Building RFM customer profiles is a common example of this. Customer activity is used to calculate recency, frequency and monetary metrics over multiple categories and multiple moving windows of time, which could include the last 24 hours, last week, last month, and last year. Transforming this data for a large number of customers and transactions usually requires parallel processing that can scale as business growth leads to increased amounts of data.

Training data sets are created by scanning and joining large datasets with a history of features and events of interest. The joins span multiple different feature groups and accurately bind each training case with the feature values that correspond to the time of each event. This creates complex join conditions best suited for MPP/OLAP processing.

On the other hand, real-time data pipelines are needed to drive ML models used to react directly to end user interactions in real-time. An example is a product recommendation engine or a next best action model, where user activity directly affects feature values used by the inference. In such cases, the source of the feature needs to be connected in real-time to the feature store either through streaming or direct database insert or update operations. As these messages/transactions are processed, they generate new feature values just in time for a subsequent inference that reads them. This kind of operation typically affects a small number of rows at a time but has potential for very high concurrency.

In many cases, user interaction may involve multiple feature sets. It’s important that, when read for inference, all feature values are consistent with the user’s recent actions. This drives the need for ACID properties in the database engine that guarantee such consistency. These high concurrency, low latency workloads are best addressed by OLTP database engines that can scale up as user & business activity grow.

Few databases support both kinds of workloads and also provide horizontal scalability. HTAP databases that scale will be needed as Feature Stores become the standard implementation patterns in Machine Learning platforms with real-time solutions. So now, the commercial: Splice Machine is such a database engine, delivering ACID-compliant OLTP and OLAP engines that can scale independently. Splice Machine’s Feature Store and its in-database model deployment provide a perfect combination to deliver both real-time and batch inference directly on the Feature Store.


Related Articles