Five reasons feature engineering is key to maximizing data science impact

A framework for evaluating data products and building high-impact teams

Mark Derdzinski
Towards Data Science

--

Photon by Stephen Dawson on Unsplash

All organizations have data and need to curate that data into actionable features. In this essay, we’ll review what we mean by features and why they are integral to good business and effective data products. Then, we’ll discuss why data scientists are best suited to develop those features, how the lens of feature engineering can quantify the value of your data science projects, and how to use this knowledge to build high-impact data science teams.

All organizations need feature engineering.

The term “Data Science” is overloaded with everything from data munging and EDA to model testing. Questions plague organizations at different levels of analytic maturity. Where does my organization need to leverage data science? Where is a data scientist better suited than an analyst/forecaster/data engineer? How can I quantify the impact of data science projects?

(Image by author)

Data scientists are feature engineers.

Sometimes data scientists will indeed pre-process data to cull garbage and edge cases. They might get involved in exploratory data analysis as they map a problem space. They could deploy statistical or machine-learning techniques to forecast trends and predict future outcomes. Sometimes they’ll conduct experiments in search of causal impact. They might even get into the weeds of operationalization, deploying pipelines and “production” code to serve the fruits of their labor (but hopefully not, more on that later…). But ultimately, this is all in service of one core goal: providing their customers with actionable, curated data in the form of features.

Features are the foundation of all data products.

I’ll define a feature here as simply a transformation of data. It’s a black box: data goes in, and features emerge. The magic in between is the realm of data science — curating data and transforming it into actionable signals that can inform business decisions, automate processes, deliver customer insights, and drive a whole world of data products.

(Image by author)

By that definition, almost all pre-processing, population statistics, AI/ML, and data manipulation is feature engineering. Therefore, we want to choose the most straightforward strategy with the most significant impact for a given use case.

Features are granular

Features are the atomic building blocks of data-informed decisions. Data assembled into a larger narrative can have a more significant impact (more on that another time), and framing work products in the language of features forces explainability into bite-sized chunks, easy for stakeholders across the technical spectrum to digest. The arc of analytics maturity progresses as the features become more nuanced, predictive, and prescriptive.

Data scientists are best suited for feature development.

Transmuting data into actionable features requires an artful blend of product sense and technical know-how. Impactful data science teams deliver this through deep domain knowledge, understanding an organization’s data assets (including edge cases and caveats), and mastery of statistical and ML/AI techniques.

Their time is best spent testing hypotheses and improving their data products. Teams empowered with self-service platforms, minimizing time spent on orchestration and scaling, find more time for product improvement. They also elevate data scientists through ownership of their handiwork. In the words of Jeff Magnuson1, “Engineers Shouldn’t Write ETL.”

You can bound the value of features.

In framing product questions using the language of features, we can anticipate a project’s value by asking stakeholders, customers, or prospective users directly — what is the value of knowing this information? Data scientists can help stakeholders express that value in dollars by considering how the information could impact revenue, retention, lifetime value, or the cost to serve customers.

Consider, for example, a hypothetical scenario where a business wants to run a win-back program that nets a return when customers react positively, but no additional revenue when customers react negatively (in fact costing money to execute). Under some circumstances, the business may not choose to reach out to any lapsed customers if the expected value is negative, depending on the success rate. However, if they could divine which customers would respond positively before selecting who is eligible, the return on investment is guaranteed! In this case, the value of a customer’s likelihood to respond positively would be the difference between the strategy without that information (not deploying the program at all, or zero) and if they had perfect information (only deploying it to customers responding positively).

This is known as the value of clairvoyance (VOC). It is a valuable tool for articulating and bounding the impact of data products in the planning or even ideation phase, without running a single experiment. Though you may not always be able to deliver this information with perfect accuracy or precision, estimating the VOC means teams can calculate the value of descriptive features, or the expected value (EV) of probabilistic features (e.g., likelihood to churn).

Conclusion

Features are an analytic team’s backbone–all organizations use curated data to drive business decisions or deliver their users’ value. Thinking about data science and data product development through the lens of features can help you:

  • Frame the impact of data products in terms of the features. This framework can:
    – Increase the explainability and interpretability of data products through discrete feature descriptions.
    – Leverage the Value of Clairvoyance to bound the impact of projects quantitatively.
  • Give data scientists end-to-end ownership of the features they develop. Data scientists with deep domain knowledge can own the end impact of their data products and be accountable for maintaining performance through continual improvement (e.g., via experimentation).
  • Support end-to-end feature ownership with horizontal platforms that allow data scientists to focus on the science of feature engineering.

I’ve found this framework helpful in building impactful data science teams and planning new data products, and I hope you will too.

[1] J. Magnusson, Engineers Shouldn’t Write ETL (2016), https://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/

--

--

Data Science leader. Passionate about meaning-making with data, building high-impact teams, and creating opportunities for others to succeed.