Features are not just for Data Scientists

A feature store origin story

Josh Berry

Published in

Towards Data Science

12 min readJul 28, 2022

Photo by Mikael Blomkvist: https://www.pexels.com/photo/simple-workspace-at-home-6476588/

In the beginning

Once upon a time, in a land far, far away, I was one of the hundreds of Data Analysts working for a Fortune 50 company. As luck would have it, I got promoted as one of the company’s first Data Scientists in 2014, alongside some fresh new-hires who had PhDs and years of predictive modeling experience. I thought that I was totally out of my league.

Before too long, it became apparent that my knowledge of SQL and business expertise was much more valuable than I had realized. A churn model is useless if you don’t know all the nuances in the data or know how to write the SQL to properly avoid leakage with a customer-time data setup. Thus, I became a valued member of the team. The other data scientists and I had a symbiotic relationship: they taught me predictive modeling, and I wrote their SQL for them.

The glory days

And so, my SQL snippets became a script, and then my script became a team project. As a team, we started to build up a master SQL script that joined 20+ tables together and calculated informative metrics for each customer at any given point in time:

Number of days since customer changed packages
Number of times customer’s bill went up by more than $5 in last 365 days
Ratio of customer’s income to the average income of same ZIP3
Number of different customers in the same house over past 5 years
Flags for whether changes in product counted as Upgrade, Sidegrade, Downgrade

If you were a data scientist starting a new model, you were only expected to bring 3 columns: Customer, Date, and Target. The master script would give you about 150 features right off the bat. This was a huge time saver. It allowed the data scientists to spend more time coming up with new and creative features, which could then be added to the master script. As the de facto “owner” of this master script, all that work fell to me. Nevertheless, life was great. Each data scientist was deploying models in 6 weeks like clockwork.

The pains of success

Then IT called. As our team grew to 10 data scientists, our impact on the database infrastructure cost big bucks. The master script we had accumulated was extremely expensive to run, and each data scientist had to run it 3 times for each model (Training, OOT1, OOT2). In classic IT fashion, they didn’t help us solve the problem. They just limited our processing power by putting us in a separate resource queue, where we had to figuratively clash in epic cage-fights over who had the right to run their query first. It was either that or give them 3 million dollars to expand the system. That wasn’t going to happen.

The alternative

So, we got in a room to brainstorm, and we realized how redundant it was to run the same query 70 times each month (that’s 10 x 3, plus the 40 deployed models for scoring). What if we ran this query once per month? Couldn’t we keep appending the results into a large table, ready to join by CustomerID and Snapshot Date? Thus, our first feature store was born. However, that term didn’t exist yet.

Back to glory days

Things were great again. Our models were getting better and better. We were up from 150 features to 400. We even dedicated 20% of our capacity to revisiting old models we already built — and get even better results due to the new features. Smooth sailing… or so we thought.

If you build it, they will come

One day, like any other day, we realized that a payment metric was much more predictive if we removed the taxes from it. So we broke out the taxes into a different feature and carried on.

Then, my boss’s boss got a call from the SVP of Marketing. Who dares change the data without warning? Apparently, through word-of-mouth, some analysts from a different team had discovered our treasure trove of customer features. They had built Tableau dashboards off this data, and they were making regular reports from it.

Before I knew it, our precious internal table was subject to many restrictions. Marketing wanted a 3-month roadmap for new features so that they could have influence. They also wanted a robust data dictionary with lineage. We even discovered that Operations had started copying our table to their own database and were building their features off of our features. Dependencies were becoming confusing. After only a few months, our flexible modeling feature store was stuck in the mud of dependencies and bureaucracy.

I didn’t realize it at the time, but something significant was happening.

I’m a data scientist, leave me alone

I read on the internet that if you ever want to get under a data scientist’s skin, tell them to document everything. As a data science team, our reaction was no different. Our first reaction was possessive: “This was our dataset. Let them make their own! This will impact our ability and speed to build models. Don’t they realize the millions of dollars the company could be losing?” By this point, I had been promoted to manager, and I was noticing a total lack of creativity regarding feature engineering. If adding features was going to be such a burden, then nobody would add more features.

A different perspective

Another manager for whom I am extremely thankful and one I shall never forget, pulled me aside and enlightened me with a new perspective that I hadn’t considered. He showed me there was an immense value to good data in general. The ability for hundreds of analysts around the company to get quick insights was vastly more than the impact of our models. Analysts were flocking to our data because there was a real need. Instead of fighting it, we had the chance to help the company in a big way that changed the company’s data culture.

And so, as a cross-functional team, we joined forces on the most successful data initiative I’ve ever seen at a company. Marketing had the great idea to give the feature store a catchy name, “Rosetta.” IT saw the benefits of efficiency and standardization, and dedicated compute and monitoring tools. Executives also saw the benefits and provided top-down encouragement for their analysts to use it.

Fast forward a year later, we had the best of both worlds. Analytics engineers who were truly gifted at data processing owned the ETL and infrastructure. Data Scientists owned the logic and requirements and eventually had 16,000 precomputed and ready-to-use features.The business units chose a subset of 400 features and carefully selected them to be in the analytical feature store for analysts (aka, Rosetta).

The fact that the data was available in a single standardized place launched the company into a golden age of analytics. Call Center analysts were discovering the nuances of how marketing activity played a huge role in achieving their KPI’s. Marketing analysts discovered a wealth of insights into a customer’s product usage. Finance discovered that the fiber upgrades could be targeted toward specific neighborhoods to reduce churn. If it weren’t for the feature store, none of these teams would have had the expertise or time to bring these datasets together for analysis.

Lessons Learned

I have since left that company for greener pastures, but I carry those lessons with me. I spent the next few years consulting, where I realized how lucky I was to have that experience. Most recently, I joined a group of great folks at Rasgo, where we are actively developing a free product to help people follow the same journey I experienced. When I meet with leaders at other companies, the most common question I get asked is, what key lessons did you learn along the way?

It is a team effort

Large organizations are known for having silos. They’re also known for significant centralization efforts that fail miserably. The key to success is not swinging entirely one way or the other. I’m a firm believer in the centralization of technology, computational resources, and “Centers of Excellence,” but at the same time, we must also embrace the power of decentralized analytics groups, who are experts in their given fields. From the perspective of building a Feature Store, you need to figure out how to make centralized data engineering and decentralized analytics work together in a synergistic way.

Define Once

Here is the cold, hard, truth: The logic of features and metrics needs to be centralized. If you’re nodding in agreement, you’ve probably experienced the disasters of decentralized analysts having to reinvent the wheel. If you’re eye-rolling in disagreement, then you’ve probably experienced the equally disastrous situation where defining and creating new metrics is bottlenecked by a centralized engineering process. Centralization of definitions is key — but you must use technology and business processes to create an efficient feedback loop to allow the decentralized analysts to contribute to the metric creation process. And by “efficient feedback loop,” I don’t mean: (1) put in a JIRA ticket, (2) wait 6 months for a new feature. If you’re doing that today, you need to reevaluate.

Compute Efficiently

Process as much inside the Modern Data Stack as possible. In my experience, we had so many tools available that it actually hurt us. Data was constantly moving from the Warehouse to Alteryx, Tableau, SAS, Databricks, hdfs, S3, and you name it. This dramatically increases IT costs, infrastructure requirements, and management complexity. Before you can scale, centralize all the main processing into 1 place. Only in special circumstances should you need to offload data to a different infrastructure for processing (streaming and IOT data, for instance).

Use Everywhere

At first, you might notice that a feature store sells itself. If you create the right metrics and features, then analysts will naturally start to use them. However, the secret to long-term success is avoiding the temptation to create silos and variations of those metrics. Again, the key is using technology and business processes to create an efficient feedback loop. You want analysts to be able to experiment, but you must provide a fast & easy way to push new metrics into the store.

Tactical advice for getting started

Consolidate

You should first pick a warehouse technology. There are many IT and end-user related concerns to account for when you make this decision. In large companies, you might have legitimate reasons to keep several data storage solutions. Still, the goal should be to consolidate as much as possible into a single warehouse that serves as the single source of truth for analysis.

The popular choice over the past few years has been cloud solutions like Snowflake. Sometimes, companies still require secondary systems for specialized data processing with spark, but the end result is storing the final results in the cloud warehouse. A good example of this is IOT sensor data: distributed storage and spark are necessary to parse the data into meaningful tabular representations, which are subsequently loaded into the warehouse.

Hire great engineers

The driving force behind this entire initiative will be your engineering team. If I could start from scratch, I would recommend ½ internal hires that know where the skeletons are buried and ½ external experts that can bring experience and thought diversity.

The effort to consolidate computational load will impact the engineers the most. One of the things I see most often with a decentralized effort is an engineer makes something simple like “Dummy Variables” and builds a pipeline that takes data from Snowflake across the network into hdfs, which gets loaded into memory via spark. From there, it runs a few lines of scala, writes the dataframe back to hdfs, then loads the results back across the network and into Snowflake. This is expensive and overly complex.

If you assemble a single team of engineering heroes, they will all understand the benefits of consolidating data processing onto the appropriate infrastructure and work together. And finally, make sure you listen to your engineers when they recommend technology that helps orchestrate their insanely complex world. I’ve noticed dbt is popular these days. It didn’t exist back then, but whatever the engineers say — trust them. We had an unfortunate mandate to use a technology called UC4 for pipeline automation, which I can safely say not a single engineer enjoyed using.

Don’t boil the ocean

You should adopt the pareto principle and focus on the key data elements that represent the core of your business. Of course, overall success is determined by user adoption, so you might need to choose data elements that buy you some advocates as well.

In my example, our business was driven primarily by subscription revenue. Therefore, customer-related subscription data was at the top of the list because “churn” was the metric that everyone was focused on. However, the marketing department (with its hundreds of analysts) needed to be an ally in the testing and adoption of this new concept. Therefore, we partnered with them to find out which data sources and metrics were necessary for them. It turned out that third-party segmentation and audience scores were critical — so we included those as a must-have in phase 1.

The other thing to remember is that you can choose to start simple. For example, when we built our first customer feature store, one of the elements we added was a source that let us know how many incoming phone calls the customer made in the prior 30, 60, and 90 days. There were more complicated elements that would have required a lot more engineering — like what queue did they land in? What was the duration of the call? How much time did they spend on hold? We knew we wanted those data elements eventually, but we started with something simple.

Build all the pipes

Depending on your company and existing infrastructure, this can range from relatively simple to extremely complex. In my example, the engineers needed to build many extract-load pipelines from around 25 different sources, so they were accessible from within the data warehouse.

This is also a critical technology decision that should go hand-in-hand with the decision to choose a warehouse. Airbyte, Fivetran, Integrate, and Segment are some modern examples that are cloud-native alternatives to more traditional solutions like Pentaho, SSIS, or Talend.

Define once

One of the keys to success is arriving at the proper definitions for your metrics and features. This is more difficult than it seems. For example, churn rate might sound simple , but what about customers who don’t pay their bills and are therefore disconnected automatically? Should you count those as churn? What about customers that disconnect because they died? Or moved into an area that your company doesn’t service? Perhaps what you really need is a Voluntary Churn metric and an Involuntary churn metric combined into a Total Churn metric.

The point here is that the business needs to be involved. Obviously, the engineers own the pipeline and the processing code behind the scenes, but you need a way to let the analysts guide that process.

Democratize

The final step is to use the data everywhere. Sometimes, users will flock to the data on their own because the data speaks for itself. Other times, you will have to put forth a strong effort to track down reports, systems, and legacy “habits” that need transitioning. Furthermore, the effort to track down and migrate legacy processes will undoubtedly uncover additional needs that you didn’t know about. So challenge yourself with questions , like, why can’t the quarterly summary to the Board of Directors be built off the new data?

The biggest reason projects like these eventually fail is because companies lose the democratization battle. I have seen it happen several times, and it is always a disaster due to the sunk cost of all the prior steps. Sometimes it is due to a lack of trust in the calculations (See Define Once, above). Sometimes it is due to poor technology decisions that introduce a skill gap (See Consolidate, above).

The other big reason for failure is the lack of a feedback loop. Unfortunately, this can take a while to manifest itself, so projects that succeed initially can fail months later. The truth is, businesses are always moving and shifting, so there needs to be a flexible process in place that controls future enhancements and developments.

For example, when we launched our first feature store, the company only sold 3 lines-of-business. Customers that had all 3 were identified as “Triple Play” customers. Before too long, we launched a fourth line-of-business. Thus, “Quad Play” customers were born. However, the metric definitions needed careful consideration as to whether they needed to be updated, and new metrics had to be created. If you don’t have an effective feedback loop, the analysts on the new product team will start reporting their own churn numbers — out of necessity. Other lines of business might want to start making exclusions and exceptions, so they’ll start building silos of new metrics . This is where technology and business processes need to make it easy for analysts to contribute to the metric definition process.

Conclusion

As a young data scientist, it was thrilling to develop a feature store before the term was ever popularized. However, I was short-sighted about how important the concept was for an enterprise-level shift into a data-driven culture. I’m so thankful for the visionaries that I got to work alongside and that I got to take part in such a journey. I hope that sharing my story will be helpful and encouraging to those who are following a similar journey. Always feel free to reach out if you want to chat; you can usually find me in slack. I’m particularly active in Locally Optimistic and DataTalks.Club, and you can always ping me directly at Rasgo.