Better Metrics ⇏ Happier Users

Designing a machine learning product to close the user-feedback loop

Alex Lisk
Towards Data Science

--

Photo by Jon Tyson on Unsplash

Imagine this “hypothetical” scenario: a new model was developed to replace the existing production model, due to detected low accuracies for some “important” classes. The new model’s metrics were much better, hence it was deployed to replace the current model.

Turns out, the new model actually made the user experience worse. Even though it was better in metrics, users didn’t feel like it was better. A post-mortem revealed that even though overall metrics were better, the new model sacrificed accuracy in classes that users cared most about, to improve classes that users cared little about.

The initial assumption was that better metrics ⇒ better model and of course better model ⇒ happier users. This assumption is critically flawed. Better metrics may imply a better model, but only a better model that is judged by metrics. A better model that is judged by metrics does not imply happier users, a better model judged by users implies happier users. While this may seem obvious, the who and the why of product development are often forgotten in the machine learning space.

This article will introduce the concept of a user-feedback loop as an essential component in any ML product design. We’ll discuss the drawbacks of common evaluation and monitoring methods, and how we can mitigate user dissatisfaction by implementing this concept in the machine learning development process.

Initial Definitions

I’ll start by defining the key terms used in this article.

Monitoring

“The goals of monitoring are to make sure that the model is served correctly, and that the performance of the model remains within acceptable limits.” [1]

User (or Customer) Feedback Loop

Note: I prefer to use the term “user feedback loop” rather than “customer feedback loop”, as a user accounts for both customers and prospects. However, these terms are often used interchangeably.

“A customer feedback loop is a customer experience strategy meant to constantly enhance and improve your product based on user reviews, opinions, and suggestions.” [2]

Feedback loops are important because without user feedback, how would you expect an organization (whose primary objective is to sell to their customers) to get better at selling to their customers?

Traditionally, “closing the feedback loop” is:

“…targeted and personalized follow-up communication to specific pieces of product feedback. Closing the loop means letting your user know how you’ve improved the product because of what they said.” [3]

However, in the context of machine learning, it is better defined as:

Utilizing the user’s feedback of the output of a model to influence the priorities of model development.

An important distinction is from the traditional “feedback loop” often described in ML, where the output of the model is used to re-train the model. This is referring to mathematical feedback, not user feedback.

Evaluation & Monitoring

Photo by Luke Chesser on Unsplash

In this section, we’ll discuss the current approaches to evaluating a model, the metrics used to assess model degradation in the monitoring phase, and most importantly the problems with common approaches.

Resource vs. Performance Monitoring

Resource monitoring involves monitoring the surrounding infrastructure of the model deployment. This is a traditional DevOps topic that we will not be discussing in this article, besides mentioning these key questions it seeks to answer:

“Is the system alive? Are the CPU, RAM, network usage, and disk space as expected? Are requests being processed at the expected rate?” [4]

Performance monitoring involves monitoring the actual model.

“Key questions include: Is the model still an accurate representation of the pattern of new incoming data? Is it performing as well as it did during the design phase?” [4]

How to effectively answer questions regarding performance monitoring is what we’ll be discussing in this article.

Ground Truth Metrics

A “ground truth” metric is a metric “…that is known to be real or true, provided by direct observation and measurement” [5]. In machine learning, it’s the expected ideal result that the model will produce. There are two types of ground truth metrics: real-time and delayed. We’ll also mention biased ground truth metrics and the absence of ground truth. For all examples described below, see [6] if you’d like more in-depth descriptions.

The ideal case is real-time ground truth. This is the case where “…ground truth is surfaced to you for every prediction and there is a direct link between predictions and ground truth, allowing you to directly analyze the performance of your model in production” [6]. A common example is digital advertising, where you receive near-instant feedback on if the served ad was successful, depending on the user’s behavior.

The more common case is delayed ground truth. Implied from the title, this is the case where there is a large delay between model output and learning how your model was supposed to perform. A common example is fraud detection: we don’t know if certain transactions were fraudulent until the cardholder reports that they are, which is often much later than the transaction date.

A common problem with both real-time and delayed ground truth is bias. Let’s take the example of loan default prediction. We can only collect the ground truth from the negative predictions (not going to default), we cannot collect any information on the positive predictions (going to default) since we denied them the loan.

Finally, we can encounter cases where no ground truth is available. We can often use proxy metrics in this case.

Proxy Metrics

If we’re dealing with delayed, absent, or biased real-time ground truth, we often use proxy metrics in place, or in addition to, ground truth metrics. They formulate a metric that is representative of the performance of the model without using the ground truth. Proxy metrics “…give a more up to date indicator of how your model is performing” [6]. They also allow you to incorporate the importance of business outcomes into your metrics.

The most common, and most widely used examples of proxy metrics are data drift and concept drift. Theoretically, a drift occurring in the independent and/or dependent variables could be representative of degrading model performance.

Problems

There is abundant (usually bad) advice on how to monitor production models. However, it’s hard to find advice that considers the actual users. Most advice is rooted in over-reliance on poorly constructed proxy metrics. Here’s the issue: proxy metrics are not perfect. They’re intended to be representative of the performance, not a direct indication. Problems are introduced when this distinction is not understood.

The main culprit is drift, coined the “drift-centric” view of ML monitoring [7], where drift is assumed to be a perfect indicator of model performance. Like all proxy metrics, drift is not perfect, and complete reliance on it is not an effective strategy for model monitoring.

An example to illustrate this point is the use of synthetic data for training object detection models. Studies have shown that real-world data can be reduced by up to 70% (replaced with synthetic data) without sacrificing model performance. We’d expect the distribution of synthetic data to be wildly different than real-world data, yet this shift does not impact performance.

This is not to say that drift should never be used. Drift should be used in monitoring “…if you have reason to believe a particular feature will drift and cause your model performance to degrade” [7]. However, it shouldn’t be used as the only metric.

In summary,

the problems resulting from the use of proxy metrics in common monitoring approaches is caused by the disconnection between model evaluation and user feedback.

For a proxy metric to be effective, truly representative of model performance, and measure what matters, it must be formed with a user-centered view.

User-Centered Design

Photo by Amélie Mourichon on Unsplash

Definition

“User-centred design (UCD) is a collection of processes which focus on putting users at the center of product design and development. You develop your digital product taking into account your user’s requirements, objectives and feedback.

In other words, user-centered design is about designing and developing a product from the perspective of how it will be understood and used by your user rather than making users adapt their behaviours to use a product.” [8]

UCD + ML

While the traditional definition of UCD fits well into product design, how does this apply to model evaluation and monitoring?

Two of the major UCD principles defined by [8] are:

  • Early and active involvement of the user to evaluate the design of the product.
  • Incorporating user feedback to define requirements and design.

These concepts seem familiar. Remember user-feedback loops? We’ll now discuss how to implement user-feedback loops in the model evaluation and monitoring phases.

Introducing the User Feedback Loop

Marketing “User Feedback Loop” (Image by Author)

Above is the flow of a traditional marketing “user feedback loop”. The key actions of the loop, starting from the users, are:

  • Ask: Involving your users and asking for feedback about your product. Common sources of feedback come directly from users, such as interviews and surveys. Indirect feedback can also be valuable, from teams such as customer success and sales.
  • Centralize: “Turning feedback into action is difficult when it’s buried in a folder or scattered across various inconsistent spreadsheets” [3]. Feedback should consistently be gathered and centralized in a “Feedback Lake”. This usually takes the form of a centralized data sharing solution, such as a centralized Google Drive folder for all spreadsheets, interviews, etc., but it can also be as simple as a #feedback Slack channel. The feedback lake will be very unorganized and will contain quite a lot of noise. Don’t worry: that’s the point. We want to break down any barriers that prevent anyone at the organization from sharing feedback they received from users. We’ll deal with this problem in the next step.
  • Label & Aggregate: To take actionable insights, feedback must be sorted in some comprehensible way. Feedback should be labeled with “…a short description, one or more feature or product categories it falls under, and names or counts of the requestors” [9]. This is then inputted into the Feedback “System of Record (SOR)” - the consolidated source of truth for user feedback. The SOR could be as simple as a spreadsheet, or as complex as a JIRA board. Regardless, it should allow easy aggregation by feedback type and frequency. “The goal here is to create a highly systematized process such that as new feedback comes in across the various input sources, it is quickly and efficiently processed into the system of record” [9].
  • Prioritize: The SOR can now be used to aggregate and identify pain points for users. However, not all feedback is created equal: “The key aspect to remember when incorporating feedback into your product roadmap process is that the way to go about doing this is never simply taking the most frequently requested features and putting them at the top of your roadmap” [9]. We should assess user feedback as a component in the product roadmap planning, along with other business goals or strategic priorities.
  • Implement & Communicate: Of course, actually implementing the product roadmap is important. However, more importantly, close the loop by communicating with your users that their feedback has been addressed, and has been / will be implemented.

With this foundation, the question still remains: how do we apply user feedback loops to machine learning products?

We’ll start with a simplified data science process:

Data Science Process (Image by Author)

I’ll assume readers are familiar with this process. If not, you can check out my previous article (the section “Data Science Process” explains this diagram in-depth).

We can see that the Data Science Process exhibits the previously discussed problem: it’s disconnected from users and their feedback. If we merge the previous two diagrams, we arrive at the following process diagram:

Customer Feedback Loop for Data Science Process (Image by Author)

We can now see that user feedback is connected to the model development process. User feedback directly influences the ML roadmap, driving future development efforts. Here’s a walk-through of the diagram (numbers correspond to the number on the diagram):

  1. The process starts from business or strategic priorities driving the ML roadmap.
  2. The roadmap defines the initial goal that kicks off the data science process.
  3. The data science process produces a served model (the product) which is then monitored.
  4. The served model is then “communicated” to users (i.e. in production).
  5. The feedback loop begins: Ask, Centralize, Label & Aggregate, Prioritize. The result is prioritized user feedback about your model (product) injected into the ML roadmap.
  6. The user feedback can cause two things to happen: (a) user feedback triggers maintenance for the existing model (i.e. users not happy with model performance) or (b) you define a new goal based on your user’s feedback. Either way, the loop is now closed by re-kicking off the data science process.

We can also see red dashed arrows in the diagram, originating from the users. These indicate the important indirect influences the users have on the data science process. Following the concepts from UCD, not only must the user’s feedback be utilized, but the user must also be involved in the design process. This is arguably more important than user feedback. If your user isn’t considered until the feedback phase, your model will be useless. We’ll describe this in more detail below.

User-Centered Metrics

The most important metric to judge a model by is actual user needs and feedback. Ideally, the user’s wants and needs are determined early in the process and incorporated into the evaluation metrics. This is illustrated in the diagram as the dashed red arrow from “Users” to “Model Evaluation”.

If ground-truth metrics are available, the appropriate metrics must be selected from user needs. For example, in email spam detection, we may have determined that our users don’t care if a couple of spam emails get to their inbox, but they really care if non-spam emails are classified as spam. In this case, the ground-truth metric we care most about would be precision, and not care as much about recall. If we were to instead use F1 (as an example), this is not a reflective metric of our users’ needs. This case seems obvious, but it gets more complex when dealing with proxy metrics.

If we need to use proxy metrics, we must construct a metric that is centered around our user’s needs. Constructing proxy metrics heavily depends on the problem domain as they usually require domain-specific knowledge. Generally, proxy metrics try to quantify the user-driven business problem. This is usually a good assumption, as performing well at the business problem generally means your model is performing well.

As an example, let’s take the loan default prediction example previously discussed. We know the ground-truth metric is biased, so we want to develop a proxy metric to quantify model performance. Suppose a business goal was to reduce the number of people that we deny a loan. Hence, a simple proxy metric would be the percentage of people denied a loan. While this is an over-simplistic toy example, it illustrates the thinking process.

User-Influenced Monitoring

This topic ties back into user-centered metrics. We generally monitor how the evaluation metrics of our model change over time. By appropriately choosing the metrics, we’ll signal degrading model performance before it begins to affect our users, not when an arbitrary metric like KL-divergence (drift detection) exceeds a pre-defined threshold. If we don’t choose our metrics according to user needs, detection of a degrading model may occur:

  • Too early and unnecessarily frequent, causing alert fatigue. It’s been said that “…alert fatigue is one of the main reasons ML monitoring solutions lose their effectiveness” [7].
  • Too late, affecting our user’s experience before we even realize it.

It’s important to note that our users should define the segments we monitor across. A great example of why this is:

“If you’re familiar with monitoring web applications, you know that we care about metrics like the 99th percentile for latency not because we worry about what happens to a user once per hundred queries, but rather because for some users that might be the latency they always experience” [7].

This can also be applied to model predictions: there may be some characteristics of certain users that cause the model to not be as accurate for them as it is for other users. Going back to the loan default prediction, the model might be very bad at predicting a certain location (for example). This is definitely not the behavior we want.

To prevent this, it’s important to monitor metrics across user segments or cohorts that are important to the business, and signal when any cohort displays degrading performance.

User-Centered Deployment

It’s important to also consider users when deploying a new model. Not just to prevent user annoyance: we made assumptions with the aggregated and prioritized feedback and how to translate them into a new business objective. We must validate those assumptions by ensuring the expected positive results are reflected in user behavior.

Common user-centered model deployment strategies include:

  • Shadow Testing (Silent Deployment): The new model is deployed alongside the old one. The new model scores the same requests but doesn’t serve them to the user. This allows the model to be evaluated against the user-centered metrics in the production environment. An obvious drawback is there’s no generated user feedback, so we’re only relying on metrics.
  • A/B Testing (Canary Deployment): The new model is deployed and served to a small number of users. This approach minimizes affected users in the case of worse performance, while also allowing the collection of user feedback. However, a drawback is it’s less likely to catch rare errors in the new model.
  • Multi-Armed Bandits (MABs): This approach can be viewed as “dynamic A/B Testing”. MABs balance exploration (new model) with exploitation (old model) to try to pick the best performing solution. Eventually, the MAB algorithm will converge to the ideal solution, serving all users with the best-performing model. The main drawback is that this approach is the most complex to implement.

Beware of Bias

Photo by Markus Spiske on Unsplash

As with most data, bias exists. Biased data will create biased models. Hence, understanding and mitigating bias is very important in the machine learning development process. In this case, the general result of biased models is that certain segments of the user base are served disproportionately worse than other segments. We previously discussed monitoring across cohorts of your user base to surface this problem, but this doesn’t mitigate the case where the metrics are the cause of the bias.

If there is bias in ground truth data, this will lead to any ground-truth metrics also being biased. There are two solutions here: eliminate the bias, or use a proxy metric instead.

However, proxy metrics can also lead to bias if not constructed mindfully. A Deloitte whitepaper stated, “…bias finds its way through proxy variables into a machine learning system” [10], giving an example using protected features in a mortgage worthiness predictor. While using features like age, race, and sex are protected by regulations, features such as postal code, dwelling type, and loan purpose “…do not directly represent a protected characteristic, but do correlate highly with a certain protected characteristic” [10]. So even if we exclude all protected characteristics as features, we can still unintentionally introduce bias if we choose a proxy metric that uses a correlated feature.

Conclusion

We set the foundation with the current approaches to evaluating a model, types of ground truth metrics, proxy metrics, and the problems with common monitoring approaches. Then pivoting into user-centered design, we introduced the user feedback loop and how to apply UCD to the evaluation, monitoring, and deployment phases. We finished by discussing the dangers of introducing bias with these methods.

I hope this article gives a more sustainable, user-centric look at model development, and a starting place on how to incorporate these principles into your own machine learning products.

This is just the beginning! If you enjoyed the article, follow me to get notified of the next ones! I appreciate the support ❤️

Sources

[1] A. Burkov, Machine Learning Engineering (2020), Québec, Canada: True Positive Inc.

[2] D. Pickell, How to Create a Customer Feedback Loop That Works (2022), Help Scout

[3] H. McCloskey, 7 Best Practices for Closing the Customer Feedback Loop, UserVoice

[4] M. Treveil & the Dataiku Team, Introducing MLOps (2020), O’Reilly Media, Inc.

[5] Ground truth (2022), Wikipedia

[6] A. Dhinakaran, The Playbook to Monitor Your Model’s Performance in Production (2021), Towards Data Science

[7] J. Tobin, You’re probably monitoring your models wrong (2022), Gantry

[8] User Centered Design, Interaction Design Foundation

[9] S. Rekhi, Designing Your Product’s Continuous Feedback Loop (2016), Medium

[10] D. Thogmartin et al., Striving for fairness in AI models (2022), Deloitte

[11] P. Saha, MLOps: Model Monitoring 101 (2020), Towards Data Science

[12] K. Sandburg, Feedback Loops (2018), Medium

[13] D. Newman, How Well Does Your Organization Use Feedback Loops? (2016), Forbes

--

--