The Most Important Metric for Machine Learning Models

Hint: it’s not f1-Score, accuracy, precision, or recall

Wafiq Syed
Towards Data Science

--

How is your model creating value for your customers and making the business money? Business folks do not understand nor care about the ML metrics (F1-score, accuracy, precision) that data scientists work tirelessly to optimize their models for. The true measure of a model’s success is how it impacts the business.

Scenario: You Develop a Strong Model that does Nothing for the Business

Say you’re a data scientist working at your favourite tech company that sells a subscription-based product. It could be Netflix, Google Drive, Amazon Prime, Apple Music, etc. You’re tasked with predicting whether a customer will churn (cancel their subscription). Identifying customers who are likely to churn is important because we can target them in a manner that encourages them to stay (ex. incentives, discounts, product teasers). As an enthusiastic data scientist, you’re eager to experiment with various models.

Magnifying glass on a table
Photo by Markus Winkler on Unsplash

You begin by spending time understanding the business case. The business prefers to retain as many customers, even at the risk of providing incentives to false positives (customers that are incorrectly predicted as likely to churn). You decide to optimize for recall, which is the number of customers that will churn and your model identifies them. After weeks of development, you launch your model into production. However, the churn rate does not decrease, and the business is disappointed. You’re disappointed because you worked tirelessly to achieve a stellar recall score. What went wrong?

There are usually two reasons why a strong model may not make a strong business impact. By strong model, I am referring to a model that has a good score (in this case recall) and follows sufficient ML Ops practices.

Reason 1: Great Model, Poor Use Case

The use case or customer challenge that your team set out to solve is not important to your customers. This is a symptom of rushing to deliver features without ensuring that there is an actual demand for that feature. For example, imagine you run an e-commerce site that sells toys and a few gaming products. Even if you build an accurate model that predicts whether a customer is likely to buy gaming products, you may end up generating low revenue for the business. There just aren’t enough gaming products to generate value.

The responsibility of ensuring that there is sufficient demand from your customers falls on your product manager.

A product manager does the groundwork to make sure that your model will be impactful

This can be an opportunity to collaborate with your PM. Together, you can test assumptions and learn. For example, before deploying an end-to-end model pipeline, your PM may work with you to decide on a Minimum Viable Product (MVP), which is the smallest effort that will test the assumptions behind the business use case. From a product standpoint, the MVP is NOT a Version 1.0, which it is often misunderstood as. I like the more understandable term, RAT — Riskiest Assumption Test (source: “Product Management’s Sacred Seven” by Neel, Parth, Aditya).

In the gaming example, we might assume that sending coupons to customers will make them purchase a gaming product. We can test this assumption by deciding on the minimum level of effort that will confidently provide evidence for or against our assumptions. An example MVP may consist of creating a simple Naive Bayes model that spits out a list of customers to target. If the model is too much of an effort, grab a list of customers who browsed gaming products in the last week. And if that is too much work, just grab a list of customers with the highest number of orders in the last month. Hold out 30% of the customers and target the other 70% with coupons. Observe the difference. If you find that the test group converted at a rate that is higher (with statistical significance) than the control group, you can proceed with building the model pipeline. If not, try to figure out why before doing any development.

Reason 2: Great Model, Great Use Case, Poor Activation

Sometimes the use case and the model are great, but the team that consumes the predictions does not take the most effective actions. Back to our churn example, if the action that the business takes is not strong enough to retain customers, the model will not lead to much of an impact. Your PM should make sure that the consumers of your model outputs take action efficiently. Experimenting with different approaches to keep customers loyal will help overcome the weak results.

Whether your model performs poorly due to a poor use case or poor activation, it’s clear that in both cases, your PM plays a critical role. So you may be wondering, as a data scientist, what should you do to make sure that you’re focusing your efforts on the right tasks?

1) Identify the Business Metric

The most important metric for your model’s performance is the business metric that it is supposed to influence.

Business folks do not understand nor care about the ML metrics (F1-score, accuracy, precision) that data scientists work tirelessly to optimize their models for. Present to your business stakeholders that your model’s F1-score is 0.9 and they’ll have no idea what you’re talking about. Try to explain what an F1-score means and you might realize that you’re not entirely sure either. 😅

Here’s the bottom line. If there’s one point you should take away from this post, it’s the following:

How is your model making the business money? The true measure of a model’s success is how it impacts the business

Ultimately, business impact is the only result that matters. And it is the fundamental difference between a Kaggle competition and modelling in the real world. In a Kaggle competition, the model with the highest score wins. In the real world, even if your model scores spectacular across metrics such as precision and recall, it can still have a poor business impact. If it makes no money, it’s useless to the business. Note, making money for the business can be one of three ways:

  1. Bring in more money (incremental revenue)
  2. Secure existing revenue
  3. Reduce costs

Before making any modelling efforts, speak with your PM on what business metric you are trying to move. Customer churn rate is the metric to measure for the success of a churn model. We want to decrease the number of customers leaving our business. It’s important to understand that reducing churn rate just based on the model and associated marketing tactics is difficult. Churn is influenced by an array of factors.

As Prinkesh correctly highlighted in the comments of this post, user experience is a strong determinant of churn.

The model helps with winning back customers who are on the verge of churning. Accordingly, we should measure the churn rate of customers who we target based on the model’s predictions. If we can reduce their churn rate, then the model is successful.

2) Understand Exactly how the Business will Use your Model

If you are not convinced that the business will use your model’s predictions properly, that’s a big red flag. Your PM is responsible for prioritizing meaningful work that they are confident will lead to impact. Raise your concerns to your PM and provide them with ideas on what will make you confident. PMs juggle several tasks at any given time. It is difficult to have all the answers. Work as a team to make sure you have everything you need to unlock value.

Summary and Question

Focus on making an impact more than anything else. To be valuable, you need to deliver value and inform relevant stakeholders in the business about the value you deliver. If you can deliver value while also exploring cool technical frameworks, packages, models, and other tools, go for it. The sweet spot for product teams is to focus your efforts on what you’re good at, what you like, and what the business needs.

Here’s a final question that should help you retain the contents of this article. In our churn example, if your first model has a recall score of 0.75, and after experimenting, your second model has a recall score of 0.8, is the second one better? Would love to hear your thoughts in the comments.

--

--