Notes from Industry

MLOps: Building a Feature Store? Here are the top things to keep in mind

A Feature Store should have 3 major building blocks and offer 10 major functionalities. These enable the Feature Store to improve the way raw data is processed and cataloged leading to a faster turnaround time for data scientists.

Pronojit Saha
Towards Data Science
7 min readSep 7, 2021

--

by Pronojit Saha & Dr. Arnab Bose

Fig 1: Feature Store Data Flow (Image by Author)

In our first article on Feature Stores, we defined what it is, why is it needed, and how it fills an important gap in the MLOps lifecycle. As can be seen in the above diagram a Feature Store has three layers, Transform (to ingested & process data and create features), Store (for storing the created features & their metadata), and Serve (to make available the stored features). In this article, we are going to understand what are the major components that the Store layer should have. We will also learn about the major functionalities that one should keep in mind while building a Feature Store or evaluating any of the existing Feature Store offerings from vendors. In the next article, we will go into how to use various technologies to implement these three layers and functionalities. Subscribe here to get notified of the same.

Feature Store: Major Components

The 3 major components that the Store layer of a feature store should have are,

  1. An offline feature store for serving large batches of features to (1) create train/test datasets and (2) use the features for training jobs and/or batch application scoring/inference jobs. These requirements are typically served at a latency of more than a minute. The offline store is generally implemented as a distributed file system (ADLS, S3, etc) or a data warehouse (Redshift, Snowflake, BigQuery, Azure Synapse, etc). For eg., in the Michelangelo Palette Feature Store by Uber, which takes inspiration from lambda architecture, it is implemented using HIVE on HDFS.
  2. An online feature store for serving a single row of features (a feature vector) to be used as input features for an online model for an individual prediction. These requirements are typically served at a latency of seconds or milliseconds. The online store is ideally implemented as a key-value store (eg. Redis, Mongo, Hbase, Cassandra, etc.) for fast lookups depending on the latency requirement.
  3. A Feature Registry for storing metadata of all the features with lineage to be used by both offline & online feature stores. It’s basically a utility to help understand all the features available in the store and information about how they were generated. Also, some feature search/recommendation capability for downstream apps/models may be implemented as part of a feature registry to enable easy discovery of features.
Fig 2: A Feature Store Framework (Image from Feature Stores for ML, 2021)

FeatureOps: Major Functionalities of a Feature Store

In our previous article, we identified FeatureOps as the gap in the present MLOps lifecycle. The absence of FeatureOps results in data scientists doing duplicate work in creating the same features, again and again, thereby significantly increasing the time-to-market. FeatureOps enables the data scientists to carry out the below-listed functionalities using a Feature Store, thereby short-circuiting their development time and also promoting team collaboration. Let's look at those functionalities now.

Fig 3: Feature Store Functionalities (Image by Author)

1. Feature Data Validation

A feature store should support defining data validation rules for checking data sanity. These are but are not limited to, checking values are within a valid range, checking values are unique/not null, and checking descriptive statistics are within defined ranges. Examples of validation tools that can be integrated into a Feature Store are Great Expectations (any environment), TFX DataValidation (deep learning environment), Deequi (big data environment).

2. Features Joins

It is required to reuse features in different train/test datasets for different models. Mostly implemented in the offline store and not the online store as it’s an expensive process.

3. Creation of Train/Test Datasets

Data scientists should be able to use the feature store to query, explore, transform and join features to generate train/test datasets used with a data versioning system.

4. Output Data Format

A feature store should provide the option for data scientists to output data in a format suitable for ML modeling (e.g., TFRecord for TensorFlow, NPY for PyTorch). Data warehouses/SQL currently lack this capability. This functionality will save data scientists an additional step in their training pipeline.

5. Temporal Operations

Automatic feature versioning across defined time granularity (every second, hour, day, week, etc) will let data scientists query features as it was at a given point-in-time. This will be helpful under the following circumstances:

  1. when creating train/test data (e.g., training data is data from 2010–2018, while test data is data from 2019–2020)
  2. make changes to a feature (e.g., rollback a bad commit of feature)
  3. to compare statistics for features and see how they change over time (feature monitoring)
  4. to go back in time and fetch a feature value. For eg, predictions made by models can also be stored in the feature store as well as the outcomes (ground truth) for those predictions whenever they become available. If the actual ground truth outcome/label for a prediction made 6 months earlier is available now, then we need to first lookup the feature values that were used to generate this outcome 6 months ago and then tag it with this new outcome/label to trigger re-training of models if required.

6. Feature Visualization

A feature store should enable out-of-the-box feature data visualization to see data distributions, relationships between features, and aggregate statistics (min, max, avg, unique categories, missing values, etc). This will help the data scientist to get quick insights on the features and make better decisions about using certain features.

7. Feature Pipeline

A feature store should have the option to define feature pipelines that have timed triggers to transform and aggregate the input data from different sources before storing them in a feature store. Such feature pipelines (that typically run when new data arrives) can run at a different cadence to training pipelines.

8. Feature Monitoring

A feature store should enable monitoring of features to identify feature drift or data drift. Statistics can be computed on live production data and compared with a previous version in the feature store to track training-serving skew. E.g. if the average or standard deviation of a feature changes considerably from production/live data to training data stored on the feature store, then it may be prudent to re-train the models. Much of the concepts highlighted in our earlier article on model monitoring here can also be adapted to enable feature monitoring.

9. Feature Search/Recommendation

A feature store should index all features with their metadata and make it available for easy query retrieval by users (if possible, natural language-based). This will ensure features don’t get lost in a heap and promote their reusability. This is generally done by the Feature Registry (as mentioned in the earlier section).

Furthermore, feature stores can enhance the visibility of existing features by recommending possible features based on certain attributes and metadata of already included features in a project. This equips junior-mid-level data scientists with insights that are generally available with senior data scientists with experience and thus enhances the overall efficiency of the data science team.

10. Feature Governance

Without proper governance, a feature store can quickly become a feature lab or worse a feature swamp. The governance strategy impacts the workflows and decision-making of various teams working with the feature store. Feature store governance includes aspects like,

  1. Implementing access control to decide who gets access to work on which features
  2. Identifying feature ownership i.e. responsibility for maintaining and update features with someone
  3. Limiting the feature type which can be used to train a particular type of model
  4. Improving the trust & confidence in feature data by auditing a model outcome to check for bias/ethics
  5. Maintaining transparency by utilizing feature lineage to track the source of feature, how it was generated, and how it ultimately flows into downstream reporting and analytics within the business
  6. Understand the internal policies and regulatory requirements as well as external compliance requirements that apply to the feature data and protect it appropriately

Feature Store Usage in a Data Scientist’s Workflow

The following diagram maps the earlier mentioned 3 major feature store components (red box) & the above-mentioned 10 major functionalities (blue boxes) w.r.t. it's usage at various points in time of a typical data scientist’s ML workflow.

Fig 4: Feature Store components & major functionalities across a DS Workflow (Image by Author)

Conclusion

Feature Stores make the data scientist’s work more convenient & efficient by abstracting much of the engineering work required to acquire, transform, store and use features in ML training and inference. It also enables rich data discovery and governance for the organization thereby unlocking value in every byte of data it generates. Organizations looking to take advantage of this and deciding to buy/rent/build a Feature Store should take into account considerations around the 3 major components and the 10 major functionalities mentioned in this article to derive the most value from their implementation.

--

--

AI Practice Leader. Full-Stack Data Scientist. Winner Best AI in Retail. Featured in AnalyticsIndia Magazine. Ex-Entrepreneur. IIT Alum.