Feature Store: Data Platform for Machine Learning

State-of-the-art open-source and homegrown feature stores that generate, manage and serve features at scale

Ning.Zhang
Towards Data Science

--

Photo Courtesy: dlanor_s on unsplash

Feature data (or simply called, Feature) are critical to the accurate predictions made by Machine Learning (ML) models. Feature stores recently emerge as an important component of ML stack and it generally enables the following tasks as part of ML workflow:

  1. Automate feature computation, e.g. backfill, UDF
  2. Manage feature metadata, e.g. lineage, version
  3. Share and re-use features across different teams
  4. Serve or extract features offline, realtime or on-demand
  5. Monitor the full lifecycle of features from generation to serving

For a survey of “state-of-the-art” feature store, https://www.featurestore.org/ consolidates and compares the major “feature store”-like systems. As noted, many tech companies has built their own feature stores in-house due to their unique data architecture and business needs. For example, Uber’s business is to serve users with low latency. For Airbnb, personalized recommendation is the key for keeping travelers booking lodges on their platform.

Within such organization, multiple ML data teams across different business units may be operating the same ML workflow independently, so it makes sense to consolidate the efforts and provide one feature store to standardize and manage the full lifecycle of features and serve all teams.

Reflecting the above idea back to the ML community, it sounds promising to have a standard and out-of-the-box feature store for general ML use cases, possibly starting from the simple ones in small scale.

In the following, I will briefly survey the leading feature stores in 2 tech companies: Uber and Airbnb, and also an open-source feature store: Feast. Finally, share my personal thoughts on generic ML platform.

Airbnb: Zipline

Airbnb built their Feature store, called Ziplin, at least 4 years back. The most recent talk is from Spark AI 2020 and here are my take-aways:

(1) many features at Airbnb were generated from “sliding window” operations. See the following example feature (average rating of a restaurant in the last 30 days).

rating_features = GroupBy(
sources=EventSource(
event_stream="restaurant_check_in_stream",
event_log_table="core_data.restaurant_check_ins",
query=Query(
select=Select(
restaurant="id_restaurant",
rating="CAST(rating as Double) - 2.5",
)
)
)
keys=["restaurant"],
aggregations=Aggregations(
avg_rating=Aggregation(
documentation="Avg rating for this restaurant",
operation=AVG,
inputColumn="rating",
windows=[Window(length=30, timeUnit=TimeUnit.DAYS)]
)
)
)

(2) To efficiently support the sliding window operations, they proposed a property of an operator, called “Reversibility”.

Reversible: (a + b) - a = b

Some operators have above property, for example, SUM, Average (AVG), Count. When computing in sliding windows, they do not need to compute for the whole window, instead they just drop what is out of the windows and add what is new to the window, because of the “Reversibility” property.

Sliding window example

Some operators do not have “Reversibility” property, such as Min, Max. When computing this kind of operators in sliding windows, a binary tree is build on the data within the sliding window, such that when the old data slides out and new data comes in, the binary tree is adjusted and its root is always the answer of Min, Max, etc.

photo credit: Zipline presentation at https://www.youtube.com/watch?v=LjcKCm0G_OY

For example, 4 used to be the Max in the window. When 4 slides out, the root will be chosen from 1, 3, 2 and 3 will be the new root.

By leveraging the tree structure, the time complexity is reduced from O(N²) to O(NlogN) and space complexity is reduced from 2N to N.

Uber: Michelangelo Palette

At early days, Uber started building their Feature Store, called Michelangelo Palette. The most recent talk is here and my takeaways are:

(1) 80% of ML workload is Feature Engineering, for example, find good feature, serve features at scale, feature parity (training / serving skew), realtime features, feature observability.

(2) The following abstractions make features organized, reusable, efficient:

  • Entity: An Uber business unit, e.g. Rider, Driver, Uber Eats
  • Feature group: A group of features commonly used together
  • Feature: ML ready data point
  • Join key: Key used to to join across features on, e.g. user_id, restaurant_id. This enables new features created on top of the existing features.

(3) Three major types of features

  • Batch feature (via Spark): e.g. sum_orders_1week, avg_order_size_30day
  • Near Realtime feature (via Kafka, Flink): e.g. eyeball_impression_5_min
  • RPC feature (signal from external API, 3rd party): e.g. geohash

(4) Feature Quality Monitoring: It is common to see feature pipeline breakages, feature data missing, drifts and inconsistency. The following approach has been implemented to tackle theses problems:

photo credit: Uber Feature Engineering presentation at https://vimeo.com/477753622

Feast: Open-source Feature Store

GoJek/Google released Feast in early 2019 and it is built around Google Cloud services: Big Query (offline) and Big Table (online) and Redis (low-latency), using Beam for feature engineering.

photo credit: Feast architecture https://feast.dev/post/a-state-of-feast/

From the current Feast architecture, its focus are feature storage, serving and registry. Here is a great article to introduce what is Feast, what are the current challenges and what are the next.

My Thoughts

For generic ML data platform, here are my 3 personal thoughts:

(1) One of the most valuable and challenging problems is the data transformation from the raw data into high-quality, ML-friendly feature. Tackling this problem highly depends on the domain knowledge of ML engineers and is driven by the business use cases. In other words, a standalone ML data platform is ineffective without serving for a specific business scenario, use case and seamlessly working with a team of good ML engineers.

(2) On technology side, the ML data platform needs to support various mainstream infrastructures underneath, no matter they are open-source or commercial, operated in-house or on the cloud. Regarding to the platform API, it has to support popular programming languages, such as Python. For security, it must provide the enterprise-level authorization and authentication to serve the customers located in highly regulated regions, like North America and Europe. For data privacy, it has to fully comply with local policies with zero compromise.

(3) On scale side, as the rapid growth on external product users and internal users of ML data platform, it should be highly scale on the business use cases and technology stack. Just to name a few,

  • provide a generic and flexible interface to let ML engineers describe any type of feature easily and accurately
  • optimize the onboarding of thousands of new features on daily basis
  • apply ever-changing privacy policies effectively and efficiently across diverse domains
  • manage the heterogenous infrastructures, add newly emerging one to the fleet and deprecate the old one smoothly.

--

--