The world’s leading publication for data science, AI, and ML professionals.

Monitoring Machine Learning Models: A fundamental practice for data scientists and machine learning…

A beginner's guide on monitoring machine learning models

Photo by Nathan Dumlao on Unsplash
Photo by Nathan Dumlao on Unsplash

Machine learning falls under the umbrella of Artificial Intelligence. It focuses on creating and developing algorithms that can analyze data, draw conclusions, and make predictions. Organizations in the banking and financial services industries can use the many insights obtained by machine learning technology to assist them make decisions in the future. An illustration of this may be a machine learning model that assigns risk scores to vehicle loan applicants by utilising various data sources for client applications. In order to assist banks in rethinking or adjusting terms for each consumer, these algorithms may then readily forecast the clients who are at risk of defaulting on their vehicle loans.

Why is it vital to monitor loan defaulters?

When the borrower fails to make payments in accordance with the loan’s terms, the phrase "bad loan" is used. This has a detrimental effect on the bank’s profitability and may also cause the lender to incur credit losses. Large numbers of subprime loans can negatively impact bank’s capital adequacy and, in the worst case scenario, result in default. Bad loans also run the danger of hindering long-term economic growth and increasing banking system uncertainty, both of which increases the risks to financial instability.

Why is machine learning model monitoring crucial for these use cases?

Let’s imagine you have a Machine Learning model to determine whether or not a candidate is qualified for a loan, and you believe you can rely on your model to forecast correctly nearly all of the time. Due to this, you may avoid dealing with a lot of guessing by using the predictions from this model for your firm, and luckily it succeeds. It’s simple to believe that the model will continue to function properly, yet this is untrue.

Because the world surrounding machine learning models is always changing, these models can and will worsen with time. Your model will produce a lot of incorrect predictions if you don’t frequently feed it relevant and updated data, and you won’t even be aware of it. Over time, these mistakes can easily accumulate, costing your bank incur heavy losses. It’s crucial to start and keep checking on your machine learning models if you want to safeguard your company and yourself from these issues.

The majority of models use a dataset to generate predictions, and the environment of the model changes with time. Your model won’t produce accurate predictions if the data you used to train it is out-of-date. In addition to system bugs and technical issues these problems go beyond what you would typically expect from a machine.

What are the many ways that machine learning models could fail?

Although model failure can have many causes, the most frequent one is model drift.

What is model drift?

Model drift is the term used to describe how a model’s accuracy degrades as a result of changing data and relationships between its input and output variables.

Image created by author
Image created by author

What causes model drift?

Although there are several causes for model drift, they can be divided into two basic groups:

Poor training data:

  • Biased sample selection – This happens when a biased or ineffective strategy was used to gather or prepare the training data. The operational environment in which the model will be used is not accurately represented.
  • Alterations in hidden variables – Although hidden variables cannot be directly measured, they have a significant impact on some of the observed variables. In essence, if hidden factors change, so will the data we can see. For example: We might consider the car’s value, the borrower’s income, their history with past car loans, the size of the down payment, etc. A person’s overall circumstance, however, may have an even greater impact, say, his/her family member is battling with a disease and needs an emergency fund for the treatment. Although these factors cannot be measured directly, they have a crucial impact on the number of loan defaulters (our target).

Environmental changes:

  • Dynamic environment – This is the most straightforward and obvious instance of instability, where the change in the data and relations is beyond our control. For instance, rules and regulations may change, user interests may change, better rivals may emerge, etc.
  • Technical issues – A malfunctioning data pipeline, a change in the value of one of the feature’s parameters, or even a bug could be to blame for these problems.
  • Shift in domain – This refers to modifications in the significance of concepts or values. For instance, as money loses value due to inflation, the price of an item or a person’s salary will have varying consequences over time.

Drift patterns

If we consider the word’s definition, "drift" essentially refers to a progressive shift through time. Similarly, drift in machine learning also occurs at varying rates. Following are the different types of drift:

  • Gradual – Over time, when new ideas are implemented, a gradual transformation will take place. For instance, consider a car price prediction model that was developed in 2015 and had a high accuracy level at the time. Due to the rise in car prices over time, after a few years, the predictions’ validity and accuracy starts to decline. This makes sense because, as we are all aware, car prices generally rise over time. However, if it is not taken into account, it can negatively affect a model’s accuracy.
Gradual drift (Image created by author)
Gradual drift (Image created by author)
  • Sudden – A drift may occur suddenly. For instance: Abrupt adjustments in purchasing patterns and consumer behaviour during and after a pandemic.
Sudden drift (Image created by author)
Sudden drift (Image created by author)
  • Recurrent – In this, the changes recur after the initial observation, or we might say that it occurs periodically. Take winter clothing shopping as an example.
Recurrent drift (Image created by author)
Recurrent drift (Image created by author)
  • Spike – These are exceptional occurrences that could have an impact on the model. For instance: Alterations brought on by a war, pandemic, recession, etc.
Spike drift (Image created by author)
Spike drift (Image created by author)

Model drift can be further classified into two main categories:

Data drift

Data drift is characterized as a shift in the data’s distribution. This is the difference between the real-time production data and a baseline or reference data set (usually the training set) that is indicative of the task the model is designed to carry out in the case of production machine learning models. Due to changes in the real world, production data may depart from the baseline data over time. Data drift can be further classified into two categories:

  • Covariate/feature drift

Covariate drift happens when the data used to train an algorithm diverges from the data used to use it. This indicates that while the distribution of the feature has not changed, the relationship between the feature and the target variable has itself changed. When statistical properties of this input data change, the same model which has been built before will not provide unbiased results and therefore leads to inaccurate predictions. For instance, a manufacturer of medical devices might use information from sizable urban hospitals to create its machine learning based system. However, once the product is in the market, the medical data entered into the system by healthcare professionals in rural areas might not resemble the development data. In some sociodemographic groups, patients from urban hospitals may be more likely to have underlying medical issues than those from rural hospitals. These discrepancies won’t be noticed until after the product has been released to the market and starts to malfunction more frequently than it did during testing.

  • Label drift

This sort of drift happens when the distribution of the class variable (y), the model’s output, or the label distribution changes. For instance, the pandemic has resulted in a considerable increase in automobile costs, which has caused the distribution of car prices to shift towards a higher default value. The ML model developed for car price prediction before pandemic won’t be able to predict input values with sufficient accuracy after pandemic.

Label drift (Image by author)
Label drift (Image by author)

Concept drift

When p(y|X) changes but p(X) stays the same, concept drift has occurred. Here, p(X) and p(y) represent the probability of observing automobile features X and car prices y, respectively, and p(y|X) represents the conditional distribution of car prices given car features. The conditional probability of car pricing given automobile attributes p(y|X) could fluctuate in the example of car price prediction. Consider that the distribution of the car’s seating capacity doesn’t change. Nowadays, customers want larger cars, thus their prices has increased. Particularly for larger cars, the conditional probability of the price of the car given the seating capacity may alter.

Concept drift (Image created by author)
Concept drift (Image created by author)

How to detect drift?

There are several ways to detect drift out of which the most common ways are:

Tracking model performance:

  • Tracking model performance metrics is the simplest method for spotting drift. Confusion matrix, accuracy, recall, F1 score, and ROC-AUC are some of the most popular performance indicators for ML models. Other model behavior measure metrics may also be crucial, depending on how we use the model.

Tracking descriptive statistics:

Now that we’ve seen some potential machine learning model failure modes and how to spot them, let’s look at how we can prevent them using NannyML!

How to avoid model failure using NannyML?

What is NannyML?

NannyML is an open-source python library that allows you to estimate post-deployment model performance (without access to targets), detect data drift, and intelligently link data drift alerts back to changes in model performance.

In the example that follows, we will demonstrate how to use NannyML and the necessary code to prevent model failure.

Let’s install and import the necessary packages:

Let’s load the data now that we have all the essential packages. This is the dataset I’ll be working with.

Let’s look at the use case before continuing!

This dataset focuses on car loans, and our goal here is to determine whether or not the borrowers will be able to repay their debt. Every row in the dataset represents a customer, and we have details on each one’s loan application, like the value of the car, their expected monthly income, if they’ve paid off the loan on a previous car, etc. Most significantly, a portion of this dataset (data with 0s and 1s in the "repaid" column) contains information about whether the customer was able to repay their car loan, and as a result it was used to train the machine learning model. We can also assess the performance of this model as along with the predicted output ("y_pred_proba" column) we also have the actual output ("repaid" column). We also have a column called "partition" that distinguishes between "reference" and "analysis" data. The data that has been extracted following the deployment of the machine learning model is referred to as the "analysis" data. The actual date and time that the forecast was made are listed in the "timestamp" column.

As we can see, in this particular use case, we can only obtain the target/predicted data after a particular amount of time, therefore we can only determine whether the customers were able to repay their debts after a year or a few months, say. This makes evaluating the effectiveness of the machine learning model quite challenging. And that’s what we’re going to utilize NannyML for, to see how well our model performs after deployment without the target data.

Let’s divide our data into "reference" and "analysis" since NannyML needs to learn about a model from a reference dataset before it can monitor the data that is actually being analyzed, which is provided as the analysis (post-deployment) dataset:

Using the training ("reference") data, we will now estimate performance for our post-deployment data ("analysis").

In the code above, we initialized our estimator with a variety of required parameters, fitted it to our reference data, and then used it to calculate an estimate of the performance for our analysis (post-deployment) data.

Output:

Performance estimation (Image created by author)
Performance estimation (Image created by author)

Without having access to the target data, NannyML employs a technique known as CBPE (Confidence-Based Performance Estimation) to calculate the performance of a machine learning model in use. Here is a fantastic explanation of the algorithm in their official documentation if you wish to fully comprehend it.

CBPE basically leverages the test data (data denoted by blue dotted line from Jan 2018 to July 2018) as the model output or the probability scores to come up with the performance for the analysis data (purple shaded region). We can see that there is a noticeable decrease in the model’s performance after deployment, which is why we have it in a slightly red shaded area. This is the primary indication that our model is not operating as planned, thus further investigation is necessary to identify the problems.

Note:

Because we don’t yet know the target data, this performance is not accurate, and the real-world estimated performance may be lower than what NannyML predicted using CBPE.

As was previously discussed, model drift is the most frequent reason for an ML model to fail, so let’s estimate the univariate and multivariate drift.

When a variable detects a sizable variation in distribution, univariate drift occurs. The univariate strategy used by NannyML to detect data drift examines each variable separately and contrasts the chunks produced by the analysis data period with the reference period. To identify drift, NannyML provides statistical tests in addition to distance measurements. They are referred to as methods. Some methods can only be used with categorical data, others can only be used with continuous data, whereas some might be used on both. We may select which methods to apply using NannyML. Following is the code to estimate the univariate drift using NannyML:

Now let’s fit the "reference" data and estimate the univarate drift for "analysis" data:

Note:

By using the to_df() method we can transform the results into a DataFrame.

The reference data, which serves as the standard against which the analysis data will be measured, must be given to the fit() method. Then, using the provided data, the calculate() method will calculate the outcomes of the drift.

Note:

Due to space constraints, I won’t be able to display the output here, but you can refer this for the same.

The next step is visualizing the results. For a given column, NannyML may plot both the drift and distribution. The results of the jensen-shannon approach for each continuous column will be plotted first, followed by the chi2 for each categorical column:

Note:

To keep the article short, I won’t display the visualization of every feature, but you may access it using this if you wish to view them all.

Jensen-Shannon distance for car_value (Image created by author)
Jensen-Shannon distance for car_value (Image created by author)
Jensen-Shannon distance for loan_length (Image created by author)
Jensen-Shannon distance for loan_length (Image created by author)
Jensen-Shannon distance for y_pred_proba (Image created by author)
Jensen-Shannon distance for y_pred_proba (Image created by author)

chi2results for categorical columns:

Chi2 statistics for salary_range (Image created by author)
Chi2 statistics for salary_range (Image created by author)
Chi2 statistics for repaid_loan_on_prev_car (Image created by author)
Chi2 statistics for repaid_loan_on_prev_car (Image created by author)
Chi2 statistic for size_of_downpayment (Image created by author)
Chi2 statistic for size_of_downpayment (Image created by author)

Distribution of continuous variables

Using NannyML we can also get details about the distributions of continuous and categorical variables. When dealing with continuous variables, NannyML creates a graphic called a joyplot that displays the estimated probability distribution of the variable for each chunk. The portions where drift was found are highlighted:

Distribution over time for car_value (Image created by author)
Distribution over time for car_value (Image created by author)

As we move from the time period of January 2019 to July 2019, we can see that there is a shift in the first quartile.

Distribution over time for loan_length (Image created by author)
Distribution over time for loan_length (Image created by author)

As we move from January 2019 to July 2019, we can see in the plot above that the loans for the cars that have received some requests have also increased.

Distribution over time for y_pred_proba (Image created by author)
Distribution over time for y_pred_proba (Image created by author)

The greatest anticipated probability has been slightly shifted down in the above plot, and the lowest level has slightly shifted up. This is an apparent sign that the data is approaching the decision boundary. This is also known as the model output drift.

Distribution of categorical variables

NannyML creates stacked bar charts to display the distribution of categorical variables for each chunk. In order to make the plots easier to examine, if a variable has more than five categories, only the top four are shown. Using the code below, we can create bar charts for the categorical variables in the model:

Distribution over time for salary_range (Image created by author)
Distribution over time for salary_range (Image created by author)

The lowest income range has marginally migrated upward, as seen in the plot above (area highlighted in white circle). Additionally, the highest pay scale has been lowered (area highlighted in purple circle).

Distribution over time for repaid_loan_on_prev_car (Image created by author)
Distribution over time for repaid_loan_on_prev_car (Image created by author)

All of the aforementioned charts have one thing in common: Each of them includes a time parameter. So, it makes sense to wonder whether or not we can perform equivalent analytics without having temporal characteristics of features in our dataset?

The machine learning model monitoring service from NannyML focuses on how things actually evolve over time, which makes it one of its unique features. However, by simply partitioning the data in the reference and analysis sets, NannyML can also be used for non-temporal analysis.

Note:The inclusion of the time factor in the analysis is advised because it can produce more accurate results than non-temporal analysis.

Multivariate drift

Multivariate drift occurs when the relationships between the input data is altered. The interpretation of multivariate change detection may be challenging, yet it is necessary to overcome the limitations of univariate change detection.

Why is it necessary to examine multivariate drift?

The drawbacks of univariate data detection techniques are addressed by multivariate data drift detection. The risk of false alerts is reduced, and it finds subtler changes in the data structure that univariate techniques miss. It also delivers a single summary value.

To find such modifications, NannyML employs Data Reconstruction with PCA. The Reconstruction Error is measured by the technique and is returned as a single value. The changes in this value reflect a change in the structure of the model inputs. The reconstruction error for the monitored model is calculated over time by NannyML, and if the results deviate from a range established by the variance in the reference data period, an alert is raised. Following is the code to estimate multivariate drift using NannyML:

In the above code, we have initialized the DataReconstructionDriftCalculator module with appropriate parameters. The reference data must then be passed to the fit() method, where the findings will be examined for future use. The calculate() method will then use the supplied data to determine the multivariate drift findings. The missing data is a crucial issue that needs to be addressed. NannyML’s default imputation method assigns the mean value for continuous features and the most common value for categorical features.

Note:Instances of the SimpleImputer class can override these defaults, in which case NannyML will carry out the imputation as directed.

Let’s visualize our multivariate drift results using the following code:

Data Reconstruction Drift (Image created by author)
Data Reconstruction Drift (Image created by author)

Following is the analysis of the above plot:

  1. The reference area is shown using a step plot in blue. The midpoint of the chunks is shown by thick square point markers on the plot.
  2. The purple step plot depicts the reconstruction error for each interval of the analysis period.
  3. The sampling error can be seen as a low-saturated purple shaded region surrounding the reconstruction error.
  4. The top and lower alerting thresholds are indicated by the horizontal red dashed lines.
  5. A warning is triggered and shown with a red, low-saturated backdrop that spans the entire width of the relevant chunk if the reconstruction error exceeds either the upper or lower threshold. A red pointer in the form of a diamond in the middle of the chunk also serves as a further indication of this.

Ranking

Features are ranked by NannyML based on the total number of alerts they have received across all methods. The estimated performance and univariate feature drift are related by this.

Ranking (Image created by author)
Ranking (Image created by author)

I hope you learned a few helpful tidbits on how to monitor your machine learning models while they are in use and that you now know why deployment is never the last step. I also hope that you now have sufficient clarity to know where to begin monitoring your models when they are in production!

Thank you for reading till here. Please let me know if you have any questions. I would be happy to help! 🙂


Related Articles