Machine Learning Monitoring: What It Is, and What We Are Missing

Is there life after deployment?

Elena Samuylova
Towards Data Science

--

Chart: model awesomeness over time
Image by author.

Is there life after deployment?

Congratulations! Your machine learning model is now live. Many models never make it that far. Some claim, as much as 87% are never deployed. Given how hard it is to get from a concept to a working application, the celebration is well deserved.

It might feel like a final step.

Indeed, even the design of machine learning courses and the landscape of machine learning tools add to this perception. They extensively address data preparation, iterative model building, and (most recently) the deployment phase. Still, both in tutorials and practice, what happens after the model goes into production is often left up to chance.

Model lifecycle: data preparation, feature engineering, model training, model evaluation, model deployment — ?
Image by author.

The simple reason for this neglect is a lack of maturity.

Aside from a few technical giants that live and breathe machine learning, most industries are only starting up. There is limited experience with real-life machine learning applications. Companies are overwhelmed with sorting many things out for the first time and rushing to deploy. Data scientists do everything from data cleaning to the A/B test setup. Model operations, maintenance, and support are often only an afterthought.

One of the critical, but often overlooked components of this machine learning afterlife is monitoring.

Why monitoring matters

An ounce of prevention is worth a pound of cure — Benjamin Franklin

With the learning techniques we use these days, a model is never final. In training, it studies the past examples. Once released into the wild, it works with new data: this can be user clickstream, product sales, or credit applications. With time, this data deviates from what the model has seen in training. Sooner or later, even the most accurate and carefully tested solution starts to degrade.

The recent pandemic illustrated this all too vividly.

Some cases even made the headlines:

  • Instacart’s model’s accuracy predicting item availability at stores dropped from 93% to 61% due to a drastic shift in shopping habits.
  • Bankers question whether credit models trained on good times can adapt to the stress scenarios.
  • Trading algorithms misfired in response to market volatility. Some funds had a 21% fall.
  • Image classification models had to learn the new normal: a family at home in front of laptops can now mean “work,” not “leisure.”
  • Even weather forecasts are less accurate since valuable data disappeared with the reduction of commercial flights.
Home office with children playing on the background
A new concept of “office work” your image classification model might need to learn in 2020.
(Image by Ketut Subiyanto on Pexels)

On top of this, all sorts of issues occur with live data.

There are input errors and database outages. Data pipelines break. User demographic changes. If a model receives wrong or unusual input, it will make an unreliable prediction. Or many, many of those.

Model failures and untreated decay cause damage.

Sometimes this is just a minor inconvenience, like a silly product recommendation or wrongly labeled photo. The effects go much further in high-stake domains, such as hiring, grading, or credit decisions. Even in otherwise “low-risk” areas like marketing or supply chain, underperforming models can severely hit the bottom line when they operate at scale. Companies waste money in the wrong advertising channel, display incorrect prices, understock items, or harm the user experience.

Here comes monitoring.

We don’t just deploy our models once. We already know that they will break and degrade. To operate them successfully, we need a real-time view of their performance. Do they work as expected? What is causing the change? Is it time to intervene?

This sort of visibility is not a nice-to-have, but a critical part of the loop. Monitoring bakes into the model development lifecycle, connecting production with modeling. If we detect a quality drop, we can trigger retraining or step back into the research phase to issue a model remake.

Model lifecycle: after model deployment comes serving and performance monitoring.
Image by author.

Let us propose a formal definition:

Machine learning monitoring is a practice of tracking and analyzing production model performance to ensure acceptable quality as defined by the use case. It provides early warnings on performance issues and helps diagnose their root cause to debug and resolve.

How machine learning monitoring is different

One might think: we have been deploying software for ages, and monitoring is nothing new. Just do the same with your machine learning stuff. Why all the fuss?

There is some truth to it. A deployed model is a software service, and we need to track the usual health metrics such as latency, memory utilization, and uptime. But in addition to that, a machine learning system has its unique issues to look after.

Monitoring Iceberg. Above water: service health. Below water: data and model health.
Image by author.

First of all, data adds an extra layer of complexity.

It is not just the code we should worry about, but also data quality and its dependencies. More moving pieces — more potential failure modes! Often, these data sources reside completely out of our control. And even if the pipelines are perfectly maintained, the environmental change creeps in and leads to a performance drop.

Is the world changing too fast? In machine learning monitoring, this abstract question becomes applied. We watch out for data shifts and casually quantify the degree of change. Quite a different task from, say, checking a server load.

To make things worse, models often fail silently.

There are no “bad gateways” or “404”s. Despite the input data being odd, the system will likely return the response. The individual prediction might seemingly make sense — while being harmful, biased, or wrong.

Imagine, we rely on machine learning to predict customer churn, and the model fell short. It might take weeks to learn the facts (such as whether an at-risk client eventually left) or notice the impact on the business KPI (such as a drop in quarterly renewals). Only then, we would suspect the system needs a health check! You’d hardly miss a software outage for that long. In the land of unmonitored models, this invisible downtime is an alarming norm.

To save the day, you have to react early. This means assessing just the data that went in and how the model responded: a peculiar type of half-blind monitoring.

Input Data — Model response — You are here — Ground Truth — Business KPI
Image by author.

The distinction between “good” and “bad” is not clear-cut.

One accidental outlier does not mean the model went rogue and needs an urgent update. At the same time, stable accuracy can also be misleading. Hiding behind an aggregate number, a model can quietly fail on some critical data region.

One model with 99% accuracy: doing great! Another model with 99% accuracy: a compete disaster!
Image by author.

Metrics are useless without context.

Acceptable performance, model risks, and costs of errors vary across use cases. In lending models, we care about fair outcomes. In fraud detection, we barely tolerate false negatives. With stock replenishment, ordering more might be better than less. In marketing models, we would want to keep tabs on the premium segment performance.

All these nuances inform our monitoring needs, specific metrics to keep an eye on, and the way we’ll interpret them.

With this, machine learning monitoring falls somewhere in between traditional software and product analytics. We still look at “technical” performance metrics — accuracy, mean absolute error, and so on. But what we primarily aim to check is the quality of the decision-making that machine learning enables: whether it is satisfactory, unbiased, and serves our business goal.

In a nutshell

Looking only at software metrics is too little. Looking at the downstream product or business KPIs is too late. Machine learning monitoring is a distinct domain, and it requires appropriate practices, strategies, and tools.

Who should care about machine learning monitoring?

The short answer: everyone who cares about the model’s impact on business.

Of course, data scientists are on the front line. But once the model leaves the lab, it becomes a part of the company’s products or processes. Now, this is not just some technical artifact but an actual service with its users and stakeholders.

The model can present outputs to external customers, such as a recommendation system on an e-commerce site. Or it can be a purely internal tool, such as sales forecasting models for your demand planners. In any case, there is a business owner — a product manager or a line-of-business team — that relies on it to deliver results. And a handful of others concerned, with roles spanning from data engineers to support.

Both data and business teams need to track and interpret model behavior.

Questions from different roles. Data scientist: why is my model drifting? Data science manager: is it time to retrain? Etc.
Image by author.

For the data team, this is about efficiency and impact. You want your models to make the right call, and business to adopt them. You also want the maintenance to be hassle-free. With adequate monitoring, you quickly detect, resolve, prevent incidents, and refresh the model as needed. Observability tools help keep the house in order and save you time to build new things.

For business and domain experts, it ultimately boils down to trust. When you act on model predictions, you need a reason to believe they are right. You might want to explore specific outcomes or get a general sense of the model’s weak spots. You also need clarity on the ongoing model value and peace of mind that risks are under control.

If you operate in healthcare, insurance, or finance, this supervision gets formal. Compliance will scrutinize the models for bias and vulnerabilities. And since models are dynamic, it is not a one-and-done sort of test. You have to continuously run checks on the live data to see how each model keeps up.

We need a complete view of the production model.

Proper monitoring can provide this and serve each party the right metrics and visualizations.

Different visualizations for data scientists,  product managers, business team and model users.
Image by author.

Let’s face it. Enterprise adoption can be a struggle. And it often only starts after model deployment. There are reasons for that.

In an ideal world, you can translate all your business objectives into an optimization problem and reach the model accuracy that makes human intervention obsolete.

In practice, you often get a hybrid system and a bunch of other criteria to deal with. These are stability, ethics, fairness, explainability, user experience, or performance on edge cases. You can’t simply blend them all in your error minimization goal. They need ongoing oversight.

A useful model is a model used.

Fantastic sandbox accuracy makes no difference if the production system never makes it.

Beyond “quick win” pilot projects, one has to make the value real. For that, you need transparency, stakeholder engagement, and the right collaboration tools.

The visibility pays back.

This shared context improves adoption. It also helps when things go off track.

Suppose a model returns a “weird” response. These are the domain experts who help you define if you can or can’t dismiss it. Or, your model fails on a specific population. Together you can brainstorm new features to address this.

Want to dig into the emerging data drift? Adjust the classifier decision threshold? Figure out how to compensate for model flaws by tweaking product features?

All this requires collaboration.

Such engagement is only possible when the whole team has access to relevant insights. A model should not be an obscure black-box system. Instead, you treat it as a machine learning product that one can audit and supervise in action.

When done right, model monitoring is more than just technical bug-tracking. It serves the needs of many teams and helps them collaborate on model support and risk mitigation.

The monitoring gap

In reality, there is a painful mismatch. Research shows that companies monitor only one-third of their models. As for the rest? We seem to be in the dark.

This is how the story often unfolds.

At first, a data scientist baby-sits the model. Immediately after deployment, one often needs to collect the feedback and iterate on details, which keeps you occupied. Then, the model is deemed fully operational, and its creator leaves for a new project. The monitoring duty is left hanging in the air.

Some teams would routinely revisit the models for a basic health check and miss anything that happens in between. Others only discover issues from their users and then rush to put out a fire.

The solutions are custom and partial.

For the most important models, you might find a dedicated home-grown dashboard. Often they become a Frankenstein of custom checks based on each consecutive failure the team encounters. To paint a full picture, each model monitor would also have a custom interface while business KPIs live in separate siloed reports.

If someone on a business team asks for a deeper model insight, this would mean custom scripts and time-consuming analytical work. Or often, the request is simply written off.

It is hard to imagine critical software that relies on spot-checking and manual review. But these disjointed, piecemeal solutions are surprisingly common in the modern data science world.

Why is it so?

One reason is the lack of clear responsibility for the deployed models. In a traditional enterprise setting, you have a DevOps team that takes care of any new software. With machine learning, this is a grey zone.

Sure, IT can watch over service health. But when the input data changes — whose turf is it? Some aspects concern data engineering, while others are closer to operations or product teams.

Everybody’s business is nobody’s business.

The data science team usually takes up the monitoring burden. But they juggle way too many things already and rarely have incentives to put maintenance first.

In the end, we often drop the ball.

To do list of a data scientist. Design A/B test. Clean up data. Define new problem. Check on my models — 30 days late.
A day in the life of an enterprise data scientist. Image by author.

Keep an eye on AI

We should urgently address this gap with production-focused tools and practices.

As applications grow in number, holistic model monitoring becomes critical. You can hand-hold one model, but not a dozen.

It is also vital to keep the team accountable. We deploy machine learning to deliver business value — we need a way to show it clearly in production! As well as bring awareness to the costs of downtime and the importance of the support and the improvement work.

Of course, the data science process is chaotic all over.

We poorly log experiments. We mismanage deployments. Machine learning operations (aka MLOps) is a rising practice to tackle this mess step by step. And monitoring is, sort of, at the very end. Yet we’d argue that we should solve it early. Ideally, as soon as your first model gets shipped.

When a senior leader asks you how the AI project is doing, you don’t want to take a day to respond. Neither be the last to know of the model failure.

Seamless production, visible gains, and happy users are key to make a reputation for machine learning to scale. Unless in pure research, that is where we aim.

Summing up

Monitoring might be boring but is essential to success.
Do it well, and do it sooner.

This blog was originally published at https://evidentlyai.com.

Machine learning monitoring is exactly what we look to solve at Evidently AI. Check out our open-source tools in Github!

Want to stay in the loop? Sign up for our updates and product news, follow us on Twitter and Linkedin for more content on production machine learning, or join our Discord community to chat and connect.

--

--

Co-founder and CEO Evidently AI. Building open-source tools to analyze and monitor machine learning models.