Legend has it that in the early 2010s it was sufficient for data scientists to master Pandas and Scikit-Learn in their Jupyter Notebooks to excel in this field. Nowadays expectations are higher and data scientists often need to navigate through the entire Machine Learning lifecycle. This includes monitoring of ML models in production. Much has been said and written about the different metrics that could be used to check a model’s health. Yet, a fundamental question has often been left open, namely when to intervene.
Modifying a malfunctioning model too late can have a disastrous impact on the business. Modifying it too early, however, can lead to unnecessary costs (time and money-wise) and overfitting. To balance between early and late interventions, it is crucial for data scientists to have a sound strategy on when to step in.
In this post, we explore how to detect a substantial change in a model’s performance based on the latest research in time series analysis.
Note: The proposed Methodology is inspired by ‘retrospective’ change point analysis (see [1]). This means, we try to detect changes in a historic data set (in this case the model’s performance). If this methodology is used for ‘online’ monitoring, it could result in multiple testing and needs to be adapted accordingly. For online monitoring, methods based on online change point detection are more suitable.
A decrease in the quality of a model can have various reasons. For example, we might update it as new data comes in, but the incoming data could be biased. Or we do not update the model, but there is a shift in the distribution of the underlying data generating process. No matter the source of the model’s worsening, we want to detect it and react in time.
To keep things simple, let’s consider a classifier and take accuracy as measure of its performance. We might train the model and reach an accuracy of 90% on a holdout dataset. When deploying the model into production, we could define a hard threshold, such as 80%, and react when the model’s accuracy falls below this threshold. Even though this seems to be a conveniently easy approach, it has some fundamental drawbacks.
First we do not observe the real (or out-of-sample) accuracy of the model, but the in-sample accuracy. The distinction between these two concepts is crucial. The out-of-sample accuracy is the (unknown) accuracy of the model on the whole population. What we observe, based on the incoming data, is the in-sample accuracy. The in-sample accuracy can be interpreted as the out-of-sample accuracy plus some added noise. Reacting immediately when the in-sample accuracy crosses the 80% threshold might be an overreaction.
Second, it is not clear which data to use for the accuracy evaluation. Do we measure the model’s performance based on all data that was not used for training or should we only use the incoming data? If we only use the incoming data, how far should we go back? Should we use yesterday’s data or consider a longer time frame? The more data we use, the later we detect changes in the model’s performance. The less data we use, the higher the variance of the accuracy. In particular, a hard threshold is sensitive to outliers.
Suppose that we measured the accuracy of the model based on the daily incoming data (as in Figure 1). A hard threshold of 80% would suggest an intervention already after 7 days, which seems to be an overreaction as the real (but unknown) accuracy still lies above the threshold.

To avoid the trade-off between bias and variance and issues with robustness with respect to outliers, it might be useful to take another perspective.
Let’s say, we want to monitor the classifier’s performance on a daily basis over T days, starting at day t=1. More specifically, let A(t) denote the in-sample accuracy of the classifier on the data at day t. Thus, we have a time series A(1), …, A(T) of in-sample accuracies. We can then split the in-sample accuracy into a deterministic part (the out-of-sample accuracy) a and a random part ε with mean 0, which is introduced due the noise of the data. In particular, we have A(t)= a(t) + ε(t) with E[ε(t)] = 0, for t=1, …, T. Note that the random errors ε can be dependent. In this case, we want to determine whether the deviation of the out-of-sample accuracy compared to the accuracy based on the holdout dataset from training is substantial. In other words, we want to test the hypotheses

Here a(0) denotes the model’s accuracy before deployment and Δ is a threshold that specifies if a deviation is relevant.

In our example, a(0) = 90% and we might want to allow a 10% deviation, thus Δ = 10%. Since a(t) is unknown, we must estimate it in order to test the above hypotheses (see Fig. 2).
There are various approaches to estimate a, such as quantile regression or local polynomial estimation. Let ã denote the local linear estimator of a with bandwidth h(T) (see [2]), then, we can reject the null hypothesis whenever

Asymptotically, as T grows to infinity, the probability of detecting an actual change converges to 1 and the probability of falsely detecting a change vanishes. In other words, we intervene "too early" with low probability and intervene in case of necessity with high probability.
Note that the above testing procedure could be refined to define an asymptotic consistent level alpha test (implying non-trivial convergence). However, in this case we would need to estimate certain parameters and tune hyperparameters, which is not practical in applications.
If the null hypothesis is rejected, we can go ahead and estimate the time when the model’s accuracy deviated more than 10% from the initial accuracy of 90%. This might be useful to gain additional insights about why the model’s quality changed. A straightforward estimator is the earliest time s such that |ã(s)– 90%| > 10%. In the example of Figure 2, the estimated time is s=17, which is close to the real time t=16 when the model’s accuracy dropped below 80%.
For an implementation of the Monitoring procedure in Python, one can use the kernel regression from statsmodels (see the code below). Note that the in-sample accuracy would come from an external source in applications and is only generated to show the method. The second code snippet creates a plot similar to Figure 2 (without the unknown out-of-sample accuracy).
Conclusion
Monitoring ML models in production might not be as satisfying as training new models from scratch but it is fundamental to detect malfunctioning processes and ensure a high quality.
Finding the right time to act is crucial to prevent unnecessary costs, overfitting of the model and devastating consequences through wrong predictions. Estimating the out-of-sample accuracy can be done through local linear estimation. A hypothesis test to check for relevant deviations from the initial accuracy can be implemented in Python easily.
As the automation of the ML lifecycle is advancing, we need to make sure to include monitoring deployed models into this process in order to take the quality of our ML pipelines to the next level.
[1] A. Bücher, H. Dette and F. Heinrichs, Are deviations in a gradually varying mean relevant? A testing approach based on sup-norm estimators (2020), arXiv preprint arXiv:2002.06143.
[2] H. Dette and W. Wu, Detecting relevant changes in the mean of nonstationary processes – A mass excess approach (2019), The Annals of Statistics, 47(6), 3578–3608.