The definitive guide to AI monitoring

Learning from our work creating production visibility for teams across deep learning and machine learning use cases

Yotam Oren

Published in

Towards Data Science

6 min readNov 3, 2020

Image from Shutterstock (under standard license)

This article was co-authored with Itai Bar Sinai.

AI teams across verticals vehemently agree that their data and models must be monitored in production. Yet, many teams struggle to define exactly what to monitor. Specifically, what data to collect “at inference time”, what metrics to track and how to analyze these metrics.

The sheer variety and complexity of AI systems dictate that the “one size fits all” approach to monitoring does not work. Nevertheless, we are here to provide some clarity and discuss universally applicable approaches.

Having worked with a multitude of teams across verticals (and both deep learning and machine learning), we have been hearing a few consistent motivations, including:

A desire to resolve issues much faster
A strong need to move from “reactive” to “proactive” — that is, to detect data and model issues, way before the business KPIs are negatively impacted or customers complain

So, how should you track and analyze your AI?

1. Define model performance metrics

Attaining objective measures of success for production AI requires labels or “ground truth” for your inference data. A few cases in which this would be possible include:

Human-in-the-loop mechanism, with annotators, customers or 3rd parties labelling at least a sample of inference data. For example — a fraud detection system that receives lists of actual fraudulent transactions (after the fact).
Business KPIs could provide a sort of “labelling”. For example, for a search or recommendation model, you could track the clicks or conversions (tied back to each inference).

The latter, by the way, could lead to the holy grail of monitoring — being able to assess precisely the impact (positive or negative) of the models on the business outcomes.

The availability of labels enables calculating and analyzing common model validation metrics, such as false positive/negative rates, error/loss functions, AUC/ROC, precision/recall and so on.

It is important to note that the labels mentioned above may not generally be available at inference time. It could be seconds after the models run (e.g., a user clicking on a recommended ad), but also weeks after the models run (e.g., the merchant notifies the fraud system about real fraudulent transactions) for the “ground truth” feedback to be available. Consequently, an AI monitoring system should enable updating labels (and other types of data) asynchronously.

A note about monitoring annotators

Needless to say, labeled data is only as good as the labeling process and the individual annotators labeling it. Forward thinking AI teams leverage monitoring capabilities to assess their annotation process and annotators. How would you do that? One example would be to track the average delta between what your model is saying and what your annotator is saying. If this metric gets above a certain threshold — one can assume that either the model is grossly underperforming or the annotator is getting it wrong.

“… the holy grail of monitoring — being able to assess precisely the impact (positive or negative) of the models on the business outcomes.”

2. Establish granular behavioral metrics of model outputs

Tracking model outputs is a must.

From one angle, output behavior can indicate problems that are barely detectable by looking elsewhere (i.e. a model’s high sensitivity might mean that a barely detectable change in inputs could really “throw off the model”). From another angle, there could be significant changes in input features that aren’t impacting output behavior as much. Therefore, metrics based on outputs are priority number one within the monitoring scope.

Below are a few examples of metrics created from model outputs:

Basic statistical analysis of raw scores, e.g., weekly average and standard deviation of fraud probability score
Confidence score/interval, e.g.,

The distance from a decision boundary (e.g., from the hyperplane in SVM models, or when using a simple threshold)
The delta between the chosen class and the second place in a multi-class classification model

In classification models, the distribution of the chosen classes
The rate of non-classifications (i.e., when none of your classes’ scores passed your threshold)

Overall, anomalies in metrics created based on outputs tell the team that something is happening. To understand why, and whether and how to resolve what is happening, the team should include features and metadata in the monitoring scope. More on this below.

3. Track feature behavior individually and as a set

Tracking feature behavior serves two purposes:

To explain changes that were detected in output behavior
To detect issues in upstream stages (e.g., data ingestion and prep)

When issues are detected in output behavior, features might be called upon to explain why. The process of explaining the issues in this case may require a feature importance analysis, leveraging one of the host of prevalent methods, such as SHAP and LIME.

Separately, tracking changes in feature behavior is another independent way to detect issues without looking at outputs. So, which upstream events may manifest in anomalous feature behavior? There are too many to count. A few examples include:

Changes in the business such as influx of new customers
Changes in external data sources (e.g., new browsers, new devices)
Changes introduced in preceding pipelines, e.g., a bug in the new release of the data ingestion code

For these reasons mentioned above, collecting and analyzing the feature behavior in production is a critical part of the monitoring scope.

4. Collect metadata to properly segment metric behavior

So far, we have covered categories of data to collect for the purpose of creating behavioral metrics. These metrics could be tracked and analyzed at the global level. However, to truly realize the value of monitoring, behavioral metrics have to be looked at for subsegments of model runs.

For example (somewhat trivial), an ad-serving model might perform consistently overall, but provide gradually poorer recommendations for retirees (as measured by declining click-through rates), which are balanced by gradually better recommendations for young professionals (as proxied by increasing click-through rates). The AI team would want to understand the behavior for each subpopulation and take corrective actions as necessary.

The crucial enabler of segment-based analysis of the behavior is comprehensively collecting contextual metadata about the model runs. These contextual metadata often exist in the AI system, but don’t contribute to features of the model.

Here are a couple of additional examples of the value in metadata driven segmentation:

Compliance assessment: A bank would like to ensure that its underwriting model is not biased towards (or against) specific genders or races. Gender and race are not model features, but nevertheless are important dimensions along which to evaluate model metrics and ensure it is in compliance with lending regulations.
Root cause analysis: A marketing team detects that there is a subpopulation of consumers for whom the recommendation model is less effective. Through metadata driven segmentation they’re able to correlate these consumers with a specific device and browser. Upon further analysis realizes that there’s a defect in the data ingestion process for this particular device and browser.

A note about model versions

Another prominent example of metadata that is helpful to track is the model version (and the versions of other components in the AI system). This enables correlating deteriorating behaviors with the actual changes made to the system.

“The crucial enabler of segment-based analysis of the behavior is comprehensively collecting contextual metadata about the model runs.”

5. Track data during training, test and inference time

Monitoring comprehensively at inference time can yield immense benefits. Nevertheless, for even deeper insights into the AI system, forward thinking teams expand the monitoring scope to include training and test data. When models underperform at inference time, having the ability to compare the features distribution for that segment of data, with the corresponding distribution for when the model was trained can provide the best insight into the root cause of the change in behavior.

If possible, we highly recommend to track the same metadata fields discussed above, also when logging training runs. By doing so, teams can truly compare corresponding segments of the data and get to the source of issues faster and more accurately.

Summary

Evaluating the performance and behavior of complex AI systems in production is challenging. A comprehensive monitoring strategy could make a real difference.

In our experience, such monitoring strategy includes defining model performance metrics (e.g., precision, AUC/ROC and others) using data available at the inference stage or even later, establishing granular behavioral metrics of model outputs, tracking feature behavior individually and as a set, and collecting metadata which could assist in segmenting metric behavior.

It is advisable to expand the monitoring scope to the training and test stages to get the full picture of the state of the system and more quickly isolate the root causes of issues.

The best performing AI teams are already implementing similar monitoring strategies as an integral part of their AI lifecycle. These teams experience less anxiety about potential production issues, and better yet, are able to extend their research into production and dramatically improve their models over time.