The world’s leading publication for data science, AI, and ML professionals.

To Monitor or Not to Monitor a Model – Is there a question?

This is the first part of a series of blogs on model monitoring. This article outlines the need to monitor a model and demonstrates two…

Notes from Industry, MODEL MONITORING GUIDE

Devanshi Verma, Matthew Fligiel, Rupa Ghosh*, Dr. Arnab Bose

*Contributed equally to this work, names are arranged in alphabetical order.

What is Model Monitoring, and Why Do We Care?

So you’ve finally finished it – you’ve found the perfect hyperparameters, set up your pipelines, done plenty of tests, and put your Machine Learning model in production! The process is over, and it’s time to stop thinking about your model, right? If only it were that easy…

Business operations that are guided by machine learning models rely on the assumptions that the distributions of data underlying the model are not drifting (data distribution is the same between training and inference) so that the model remains as valid as it was immediately after training. To verify that is the case, machine learning models in production need to be monitored such that their performance (accuracy/memory usage/CPU usage) is within the range of expectations set during training and testing.

When a model starts to drift, it will likely go undetected until it adversely affects business operations. For example, in healthcare, disease prediction accuracy can start to deteriorate such that preventive measures (based on the prediction) become ineffective. In retail, coupons sent to customers who are expected to use them (based on model prediction) remain unused. All these impact business operations and revenue. An example of a distribution shift is illustrated in Figure 1.

Figure 1: Distribution shift. (Image by authors)
Figure 1: Distribution shift. (Image by authors)

How to Monitor Models?

There are multiple methodologies used to monitor models – some key techniques such as prior probability shift and covariate shift are outlined in [1]. Furthermore, there are multiple open source and commercial software implementations of model monitoring methodologies. In this article, we walk through an open-source model monitoring library Evidently AI [2].

Evidently AI

Evidently AI is an open-source Python package that allows for the creation of dashboards that compare training and production datasets. It easily integrates into python code to detect data drift. We want to investigate its capabilities for model monitoring, as the cost (free!) and ease of installation (from PyPi) make it an attractive candidate for use.

How Evidently AI Works

Evidently AI works by analyzing the training and production datasets. It maps the data from features in the training data to their counterparts in the production data. For example, in the weather model mentioned later in this article, mapping the training data from St. Louis to the production data from St. Louis. Thereafter it runs different statistical tests depending on the input. Evidently AI then creates graphs that are based on the plotly python library, and you can read more about the code in their open-source GitHub repository.

For binary categorical features, it performs a simple Z-test for a difference in proportions to verify if there is a statistically significant difference in how often the training and production data have one of the two values for the binary variable. For multivariate categorical features, it performs a chi-squared test, which aims to see if the distribution of the variable in the production data is likely based on the distribution in the training data. Finally, for numeric features, it performs a two-sample Kolmogorov-Smirnov test for goodness of fit that assesses the distributions of the feature in the training and production data to see if they are likely to be the same distribution, or if they vary from each other significantly.

In all of the above-described cases, Evidently AI flags statistically significant results. This flag indicates that something has changed in distribution between the training data and the inference data used to make predictions in the real world.

Our Examples

In this article, we demonstrate how Evidently AI works with two examples of varying complexity. In the first example, a model uses the temperatures of 5 nearby cities (Detroit, Milwaukee, Omaha, Toronto, and St. Louis) to predict the weather of Chicago. This acts as a simple example of searching for data shifts in tabular numeric data. In this case, the model is a Support Vector Machine – for simplicity, the model is not approached from a time series perspective (the idea is that it could predict the temperature of Chicago in real-time based on the temperature of the others). This scenario allows for the possibility of drift (as the weather changes seasonally), as well as overall shifts.

In the other example, images of plant seedlings are run through a Convolution Neural Network (CNN) model, with the goal of differentiating sugar beet plants from other similar seedlings. Here, drift is demonstrated through the use of a different dataset in production. This is a non-trivial example, where the data being monitored is not easily expressed in tabular format.

Tabular Data Shift Detection in Evidently AI

One change that is made to illustrate a data shift is to switch data from Toronto to Phoenix, AZ. As expected, the distribution of temperatures (in Celsius) is quite different between the two cities, especially over the dates used in this example (from March to early June). Figure 2 is a quick look at a sample from the data before it is changed.

Figure 2: Original weather data. (Image by authors)
Figure 2: Original weather data. (Image by authors)

The same data after the change is shown in Figure 3. Notice how "Toronto" has significantly higher values than the other cities, as its value has been changed to that of Phoenix!

Figure 3: Weather data after the shift. (Image by authors)
Figure 3: Weather data after the shift. (Image by authors)

Evidently AI is used to monitor the model – its input is a data frame of the original training data and a dataframe of production data, and it creates an analytics dashboard highlighting differences if any.

It identifies the input variables in each of the datasets (here df_old corresponds to the original training data, and df corresponds to the data which the model is currently predicting). Then Evidently AI matches the variables across the datasets and produces a report highlighting how there are differences. A snapshot of the generated dashboard is below in Figure 4.

Figure 4: Drift dashboard in EvidentlyAI.(Image by authors)
Figure 4: Drift dashboard in EvidentlyAI.(Image by authors)

It is immediately clear that there is a drift in the distribution for "Toronto", as there are now far more high values. Interestingly, a shift is also detected for Detroit, which is not tampered with. This demonstrates how there is always a chance of an erroneous result – and even though we did not alter the data here, it indicates that we should keep an eye on this feature.

Evidently AI also has capabilities for monitoring drift in the outcome variable. This is to understand if the model’s predictions are deviating from what was seen during training. This model output drift is referred to as target shift or, more commonly as, prior probability shift. That’s when the input data distribution doesn’t change but the target variable distribution changes. For example, we can use this in the context of credit card defaulters. The characteristics of the population may vary over time. The ratio of total daily customers may also vary. As a result, the model used last year may not be effective today as a result of this change. For example, the prior probabilities maybe 20% defaulters and 80% people who pay their debts, as compared to now which could be 30% defaulters and 70% who paid their debts. This so-called "Target Shift” can have serious business implications on a model in production and should be closely monitored. An example of this is shown in Figure 5 below.

Figure 5: Prediction drift aka Target Shift. (Image by authors)
Figure 5: Prediction drift aka Target Shift. (Image by authors)

It is seen how in the current dataset there is a shift in distribution. Further analysis is available, showing how predictions correlate with each of the features. This helps to monitor data drift by seeing if the relationship between our variables and the outcome has shifted significantly. If yes, the model may no longer be appropriate for production. This offers us another way to assess if our model needs to be retrained as shown in Figure 6.

Figure 6: Prediction correlation. (Image by authors)
Figure 6: Prediction correlation. (Image by authors)

Evidently AI also showcases the prediction values and prediction behavior by feature, as highlighted in Figure 7 below.

Figure 7: Prediction behavior by feature. (Image by authors)
Figure 7: Prediction behavior by feature. (Image by authors)

Note the shift in axes – whereas the reference (training) values of Toronto (which actually correspond to the weather in Toronto) range from 0 to 30, those of the new (test/production) values (Phoenix) range from 0 to 40.

Image Data Shift Detection in Evidently AI

In the second example, we investigate data drift in image data. Our model is a multi-class image classification model that takes inputs of pictures of seedlings and classifies them into one of the 12 available classes[3]. The monitoring detects when the production data drift and includes images of seedlings not in the 12 classes. This data-drift monitoring is done by comparing production data to model training data.

Imagine the scenario that our training data is composed of images of sugar-beet, but the test/production data receives images of seedlings of a different type of seedling, say, shepherd’s-purse. This is shown in the figures below; in Figure 8a we have the training data images and in Figure 8b we have the test/production data images of the shepherd’s purse. Visually it is clear that the training and production datasets are different. Our goal is to detect data drift of the features between the two samples using Evidently AI.

Figure 8(a): Sugar-beet images training data. (Image from Plant Seedlings Dataset [3])
Figure 8(a): Sugar-beet images training data. (Image from Plant Seedlings Dataset [3])
Figure 8(b): Shepherd's-purse images test/production data. (Image from Plant Seedlings Dataset [3])
Figure 8(b): Shepherd’s-purse images test/production data. (Image from Plant Seedlings Dataset [3])

Approach for managing image data with Evidently AI:

As mentioned in the previous example, Evidently AI assesses the distributions of the features in the training and production data to see if they are likely to be the same distribution, or if they vary from each other significantly. Therefore, we need to convert the images data into a format for Evidently AI to compare features.

Here is a brief description of the sequence of steps to load the data into Evidently AI. This is shown in Figure 9.

Load Data and prepare for processing:

  1. Read in a set of images used in the training of the ML model.
  2. Read in a set of images from current production.
  3. Resize both sets to an appropriate array size depending on the granularity required. The array size will be n-by-n where n is the number of pixels for the image and is also the number of features for the image.

Feature Engineering:

  1. Create an array of features (n * n) from the value of the pixels. As mentioned above the pixels represent the features of the image.

Load EvidentlyAI, run the Data Drift module, and generate data drift dashboards.

Figure 9: Steps for checking image data drift. (Image by authors)
Figure 9: Steps for checking image data drift. (Image by authors)

An example of the original image of a sugar beet seedling is shown below in Figure 10, followed by the 50-by-50 resized representation of the same image.

Figure 10: Original image and re-sized image. (Image by authors)
Figure 10: Original image and re-sized image. (Image by authors)

What is the appropriate value of "n" which determines the number of features for these images? We need to determine the number of features that are appropriate for the types of images we are dealing with. We experimented with a few different sizes of arrays and for this specific example we found that while resizing the images to 5-by-5 size is adequate for Evidently AI, but that did not visually illustrate the characteristics of the image. Therefore, for this example, we opted to go with resizing to 50-by-50 where we can visually see the characteristics.

EvidentlyAI compares the distributions for each of the features and presents results with a p-value to indicate the statistical significance. A p-value of less than 0.05 is considered as a statistically significant drift and is used to determine model retraining.

Detecting Data Drift using Evidently AI

While Evidently AI can tell us which features have drifted, the feature names are largely arbitrary and come from the process of preparing the image data for Evidently AI. So, how can we visually assess which features/pixels showed drift? Luckily, Evidently AI’s helpful Github repository has relevant code to identify the features (in the matrix of features) where data drift is detected. As shown in Figure 11, drift is either TRUE or FALSE.

Figure 11: Feature drift matrix report. (Image by authors)
Figure 11: Feature drift matrix report. (Image by authors)

In order to re-create an image using the above TRUE/FALSE values, we simply convert this to a NumPy array, reshape to our original n-by-n size, and display the image. This step is necessary to determine where the drift is and what is the significance of this type of drift for our model.

In experiment 1 we compare sugar-beet images in training with sugar-beet images in test/production and expect minimal drift. Here, this is generally the case – there is data drift, but it occurred in only 2.5% of the features. The following image shows the pixels where drift is detected. Overall, they seem to be fairly randomly distributed across the image and do not point to any broader patterns. This is seen in Figure 12.

Figure 12: Visualizing features where DRIFT=TRUE in experiment 1. (Image by authors)
Figure 12: Visualizing features where DRIFT=TRUE in experiment 1. (Image by authors)

In experiment 2 we compare sugar-beet images with shepherds-purse images in test/production and expect to see drift in a large number of pixels (like the shape of the leaves in these seedlings is very different). True to our expectations, here is the plot of the 907 pixels in which we detect drift! Furthermore, it appears that the feature shift is broadly distributed throughout the image, with large patches of the image seeing significant shifts. This is seen in Figure 13.

Figure 13: Visualizing features where DRIFT=TRUE in experiment 2. (Image by authors)
Figure 13: Visualizing features where DRIFT=TRUE in experiment 2. (Image by authors)

Plotting the above shows us the spread of the drift to help us determine if this is significant or not and whether or not the model should be retrained.

Evidently AI Data-drift dashboard output

Evidently AI provides a second form of representing data drift in a tidy dashboard that illustrates how all the features are compared in the two training and current/production datasets, as seen in Figure 14.

Figure 14: Snippet from data-drift dashboard for experiment 1. (Image by authors)
Figure 14: Snippet from data-drift dashboard for experiment 1. (Image by authors)

Drilling down into one of the features where there is no drift detected, we see the following distribution of data in Figure 15.

Figure 15: Drill down from data-drift dashboard for feature 1607. (Image by authors)
Figure 15: Drill down from data-drift dashboard for feature 1607. (Image by authors)

For the next two sets of images where we compare images of sugar beet with images of shepherd’s-purse, we see that Evidently AI detected data drift in 907 features out of 2500, as shown in Figure 16 – a whopping 36% of the features! We can immediately tell that these images are fairly different from the ones used in training.

Figure 16: Snippet from data-drift dashboard for experiment 2. (Image by authors)
Figure 16: Snippet from data-drift dashboard for experiment 2. (Image by authors)

The corresponding p-value is shown for the distribution comparison of each feature. Drilling down into the distribution of data shows the following for each of the features where data drift is detected. This is seen in Figure 17.

Figure 17: Drill down of feature 1373 in data-drift dashboard for experiment 2. (Image by authors)
Figure 17: Drill down of feature 1373 in data-drift dashboard for experiment 2. (Image by authors)

Categorical Target drift (Prior Probability Shift)

To evaluate the prior probability shift for our categorical target variable, Evidently AI uses chi-square tests (as mentioned above) which are applied to assess whether a variable is coming from a similar distribution or not. Here’s how to create them in Evidently AI:

Figure 18: Distribution of the target in the current (test/production) and reference (training) dataset. (Image by authors)
Figure 18: Distribution of the target in the current (test/production) and reference (training) dataset. (Image by authors)

As you can see in Figure 18, the output for the dashboard contains p values of the test, along with proportions for reference (training) and the current (test/production) dataset with target drift once you hover over the graphs. Some of the key drivers appear to be the vastly different proportions of maize and cleavers between the training and production datasets.

Model Performance

Evidently AI also has a classification model performance feature, which supports both binary and multi-class classification. To run this report, we need input features with actual and predicted target variables. The tool currently supports tabular data, and the team intends to add more features for it. The team also plans to extend the support on image and text data as well. Here’s how to create such a report in Evidently AI:

Here is part of the sample output from the image data:

Figure 19: Performance metrics for reference (training) and current (test/production) dataset. (Image by authors)
Figure 19: Performance metrics for reference (training) and current (test/production) dataset. (Image by authors)

As you can see in Figure 19 to assess model quality, EvidentlyAI dashboard shows macro averaged metrics like F1 Score, Precision/Recall, and Accuracy which is helpful for a multi-class problem.

Figure 20: Distribution of class (in absolute numbers) in reference (training) and current (test/production) data. (Image by authors)
Figure 20: Distribution of class (in absolute numbers) in reference (training) and current (test/production) data. (Image by authors)
Figure 21: Confusion matrix for reference (training) and current (test/production) data.(Image by authors)
Figure 21: Confusion matrix for reference (training) and current (test/production) data.(Image by authors)
Figure 22: Quality metrics for reference (training) and current (test/production) data. (Image by authors)
Figure 22: Quality metrics for reference (training) and current (test/production) data. (Image by authors)

As seen in Figures 20, 21, and 22, Evidently AI also provides us with a visual representation of class distribution in the datasets, confusion matrices, and quality metrics for each class. Similar to the case above, the interactive dashboard is created using plotly. A report like this is very helpful to analyze model runs, run A/B tests, debug, and determine appropriate thresholds for possible retraining.

Conclusion

In conclusion, the output from Evidently AI can be used to identify all the features that demonstrate a data drift. Whether or not we need to re-train the ML models based on these features depends entirely on the business use case and how critical these features are to the model’s predictions or classification.

As for Evidently AI itself – it performs its role admirably given it is an open-source platform. It is able to identify cases of data drift in both examples investigated. However, it is designed more for a spot-check analysis of data drift than for an ongoing monitoring system. Evidently AI is very useful for models that may not be generating output continuously, or for teams with few models in production.

A note that we plan to address in later blogs is concept shift, a type of drift that is harder to catch through standard model monitoring techniques. Concept shift is a shift in the model design or target variable assumptions or a shift in the fundamental relationship between the dependent and independent variables. Examples may include when the model’s predictions are no longer of importance or relevance to the business, or if there is a newly available important independent variable that the model does not account for. While Evidently AI does not catch these types of issues, no model monitoring platform really can, as concept shift generally must be identified by the business owners, depending on the role the model plays for the business.

You can view the complete code at https://github.com/thisisdevanshi/MLOps-Blog.

References

[1] https://www.kdnuggets.com/2021/01/mlops-model-monitoring-101.html

[2] https://evidentlyai.com/

[3]Giselsson et al., A Public Image Database for Benchmark of Plant Seedling Classification Algorithms (2017), Available from arXiv preprint <arxiv.org/abs/1711.05458>.


Related Articles