The Ultimate Tool for Data Science Feature Factories

Ben Epstein
Towards Data Science
5 min readAug 23, 2019

--

As the co-founder and CEO of Splice Machine, Monte Zweben, wrote about in an earlier post, How Data Science Silos Undermine Application Modernization, there are a number of important steps any team must take to avoid costly silos that can hinder the modernization journey. Two of the most important are creating the right team for the right features and creating a culture of experimentation through feature factories. The former is quite straightforward: find the statistics wiz, the subject matter expert, and the SQL genius, and you’re well on your way to your data science dream team. The big challenge we see in many companies is the latter, the feature factory, especially with something I like to call feature organization. Before we begin, however, let’s define what a feature is: A feature in data science is a piece of information that your machine learning model can use to predict the outcome (label). Think of a column in an Excel spreadsheet; the more columns you have, the more your model has to base its prediction off of.

What Data Scientists Need: Feature Organization

Feature organization is the process of keeping every feature you’ve tested, data set trained on, and model deployed in an organized and cohesive workspace, with easy access to all data scientists and engineers on your team. Now, features are crucial. They will make or break your model; garbage in, garbage out. Finding and maximizing the signal of those features takes time, experimentation, and a culture of “fail fast and move on.” But the more you experiment and test and create, the harder it gets to remember exactly what you did and how you got there. And if something goes wrong in the pipeline, without that trail of breadcrumbs, it becomes nearly impossible to pinpoint the culprit. This is where organization becomes essential. Without quality feature organization, you’re doomed to a life of scattered features, random hyperparameters and mismatched models.

What Data Scientists Have: Feature Overload

Let’s consider a typical day in the life of a data science team. A team of skilled data scientists are tasked with using the company data to build a model to accurately predict which order items will be available and when they will be delivered to the customer, also known as ATP (Available to Promise). This is an important problem for many companies and can greatly reduce losses and increase customer satisfaction if done right.

The team quickly takes action: the SQL expert begins organizing the disparate tables with complex joins making it easy to ingest and experiment on, the subject matter expert uses her deep knowledge of supply chain management to begin drafting valuable features that will bring in signal, and the data scientist begins reading research papers for the best models to build for the problem, experimenting with the features decided upon. The first PoC model is built; it’s not great, but the general idea is there. They discuss the results, tweak the features, remove some noisy ones, and try again. The model is a little better this time, but they can beat it; the data scientist switches to a neural network architecture. Getting better now, more features added, L1 is decreased, L2 increased, two new features added, one removed. Accuracy decreases, but F1 increases, that’s good, right? A hidden layer is added, then removed, then a few neurons are added. Do we know if we are converging yet? Things are getting complex very quickly.

What’s The Cost: Are We Really Doing the Best Data Science We Can?

While working as a data scientist, I created endless Excel spreadsheets trying to remember and record which subtle combination of features, standardization techniques, encoding schemes and model hyperparameters I used for each experiment, and the various metrics each one delivered. It’s exhausting, discouraging, and pushes me to pick the first model I see that works instead of meticulously analyzing my experiments for the best possible combination. On top of that, when it comes time to retrain and redeploy a model, chances are I’ve forgotten or misplaced that spreadsheet, and need to start tweaking from square 1.

I’ve seen some companies try to create their own solution to this problem, and although a proactive approach is admirable, rolling your own solution takes away valuable time and resources from your team that they could be using to actually being doing data science. Worse, this is not a one-time investment as any homegrown solution will require continuous maintenance and updates as new libraries are released and old ones improved.

The Start of the Feature Factory: MLFlow

This is about the time in the development process where engineers begin looking to the incredible open source community. MLFlow is one of my favorite open source projects in the data science community because it redefines feature organization and jump-starts the creation of your feature factory. It standardizes the process of governance over your entire data science workflow, from overarching experiments to single run trials to individual data scientists on your team. MLFlow allows you to track your process as granularly (or broadly) as you’d like and frees your data scientists to get back to doing what really matters. Every hyperparameter tweak, every feature change, every possible metric you can think of, all recorded in one organized location (more on this in the next article).

So you have your ideal model, every hyperparameter is tuned to perfection, and you’ve squeezed the maximum amount of signal from your features, now what? Well, MLFlow also makes deployment easy too. In a few lines of code, you can have your model running live on SageMaker or AzureML, no matter the library you chose, with easy APIs to get your predictions in real-time. MLFlow is the tool that keeps your team in sync, focuses them on the task at hand, and brings them right to the action.

At Splice Machine, we built a wrapper around MLFlow called ML Manager that we use every day in our experimentation. It connects natively with our database and allows us to track and deploy models that we build right at the source. It has made our internal workflows much more organized, and our customers have really enjoyed it too. You can click here to learn more about ML Manager and the Splice Machine platform, or register to view a webinar on how to operationalize machine learning.

Update: For a more in-depth and technical review of MLManager and MLFlow with some code examples, check out my follow up article.

--

--

Founding Engineer @ Galileo. Working on data-centric AI and looking for new hikes.