Data Drift — Part 1: Types of Data Drift

The different types of data drift that cause model performance degradation

Numal Jayawardena
Towards Data Science

--

Image by Author

In this series of posts, I plan to explain the concept of Data Drift and how it can cause model performance degradation, how to identify data drift, and how to come up with a monitoring plan to help identify data drift and performance degradation early on.

Let’s consider the scenario where you’ve developed a model that predicts the probability of default of clients on their loans and it has very good performance. Let’s say for simplicity that the performance metric is accuracy and your model has 86% accuracy. But a few months after deployment, turns out your model isn’t doing so well. The performance has dropped to 70%. Or maybe, delays in the project approvals and model deployment, caused the trained model to just sit there for months without being deployed into production. When it finally was deployed, the performance wasn’t great. So does that mean the model you developed back then wasn’t actually good? Not necessarily, “shift happens” (get it?). In this series of posts, I’ll talk about how data drift (or data shift) might be behind the degrading performance, the different types of data drift, how to identify data drift, overall model monitoring practices so that these changes don’t catch you off guard and finally how to overcome and fix your model after identifying data drift.

In this post (Part 1) let’s take a look at the Different types of Data Drift and some examples of how they can occur. This will help us understand how data drift can cause model performance degradation.

Data shift, data drift, concept shift, changing environments, data fractures are all similar terms that describe the same phenomenon: the different distribution of data between train and test/production data ¹

When these changes occur, we are breaking the fundamental assumption of machine learning models. The assumption that the past (training) data is representative of future (test/production) data.

Types of Data shift

1) Covariate Shift (Shift in the independent variables):

Covariate shift is the change of distributions in one or more of the independent variables (input features). This means that due to some environmental change even though the relationship between feature X and target Y remains unchanged, the distribution of feature X has changed. The graph below may help understand this better.

Image by Author

In our Probability of Default example from above, this could mean that due to the pandemic many businesses closed or their revenues decreased, they had to reduce staff, etc, but they decided to keep paying their loans because they were afraid that the bank may take seize their assets (different distributions for the X variables but the same distribution of Y).

Another example could be if a model you put into production starts to see a new age demographic as its user base grows. You might have trained on younger users, but over time you might have a lot of older users as well. And so you will see an increase in mean and variance, and therefore a data drift.

Performance degradation will be more apparent when this sort of shift happens in one or more of the top contributing variables of a model. You should verify that the input features are stable (i.e check for this sort of shift within and between the train and test sets) during model development as well, and then continue to do so with model monitoring after deployment. To learn how to identify covariate shifts, please see the following posts: Part 2 & my blog post.

In more formal definition terms, covariate shift is the situation where Ptrain(Y|X)=Ptest(Y|X) but Ptrain(X) ≠Ptest(X)

Where Ptest could be your test set or data after the model has been deployed.

2) Prior Probability Shift (Shift in the target variable):

With prior probability shift, the distribution of the input variables remains the same but the distribution of the target variable changes. For example that could look something like this:

Image by Author

In our Probability of Default example from above there could be some companies that were not really affected by the lockdown and have not suffered any revenue losses but they deliberately chose not to repay their loan installments to take advantage of government subsidies and maybe save that money in case the situation does worsen for them in the future (same X distribution but different Y).

In more formal definition terms, covariate shift is the situation where Ptrain(X|Y)=Ptest(X|Y) but Ptrain(Y) ≠Ptest(Y)

Where Ptest could be your test set or data after the model has been deployed.

3) Concept Shift

With concept drift the relationships between the input and output variables change. This means that the distributions of input variables (such as user demographics, frequency of words, etc.) might even remain the same and we must instead focus on the changes in the relationship between X and Y.

In more formal definition terms, concept shift is the situation where Ptrain(Y|X) ≠ Ptest(Y|X)

Where Ptest could be your test set or data after the model has been deployed.

Concept drift is more likely to appear in domains that are dependent on time, such as time series forecasting and data with seasonality. Learning a model over a given month won’t generalize to another month. There are few different ways in which concept drift might show up.

Gradual Concept Drift

Gradual or incremental drift is the concept drift that we can observe happening over time and therefore can expect. With different changes in the world our model gradually becomes outdated resulting in a gradual decline in its performance.

Image by Author

Some examples of gradual concept drift are²:

  • Launch of alternative products — products that weren’t available during the training period (for example if the product was the only one of its kind in the market) can cause unforeseen effects on the model since it has not seen similar trends before
  • Economic changes — changes in interest rates and maybe its effect on more loan borrowers to default on their loans can cause changes.

The effect of situations like these can add up over time to cause a more dramatic drift effect.

Sudden Concept Drift

Image by Author

As the name suggests, these concept shifts happen by surprise and suddenly. Some of the most apparent examples came when COVID-19 first struck on a global scale. Demand forecasting models were heavily affected, supply chains couldn’t keep up and suddenly there were things like the great toilet paper shortage of 2020 (Not an official term, just something I call it 😆)

Image by Author

But such changes can also happen during regular function of a company without a pandemic happening².

Major change to a road network: The sudden opening of new roads and closing of others or the addition of a new public railway system may cause trouble for a traffic prediction model until it has collected some data to work with as it has never seen a similar configuration before.

New equipment added to a production line: New equipment presents new problems and the reduction of old problems. So the model will be unsure of how to make good predictions.

In general, any major change in the environment that throws the model into unfamiliar territory will cause performance degradation.

Recurring Concept Drift

Image by Author

Recurrent concept drift is pretty much “seasonality”. But seasonality is a common in machine learning with time series data and is something we are aware of. So if we expect this sort of drift, for example a different pattern on weekends or on certain holidays of the year, we just need to make sure that we train the model with data representing this seasonality. This sort of data drift usually becomes a problem in production only if a new pattern develops that the model is unfamiliar with.

In practice, identifying the exact type of data drift is less important. Often, the drift can be a combination of these things and subtle. What matters is identifying the impact on model performance and catching the drift on time so that actions such as retraining the model can be taken early on.

Now that you suspect there’s data drift going on, how do you identify where the data drift behind your model’s performance degradation is occurring? To learn about how to identify data drift in your datasets, please read Part 2 of this series.

Check out my blog at https://practicalml.net for more upcoming posts

--

--

Software Engineer turned Data Scientist. Interested in solving business problems through machine learning. Find me at https://www.linkedin.com/in/numalj