DATA SCIENCE

Prediction models built with Machine Learning algorithms often underperform in the real world.
Even AIs built by some of the world’s leading experts often struggle to replicate the promising performances outside of the "laboratory". A prominent example is the Health Care AI systems developed at Google.
The AI applications aim to assist in diagnostics. From cancer screenings to disease detection and risk profiling.
The applications were very successful when trained and assessed within the laboratory setting. But when tested in the real world, for example in clinics in Thailand, the performance couldn’t match that of the laboratory setting.
In this case, the data provided to the algorithm was of a lower quality than the training data. But many different underlying causes lead to the same problem: Data shifts.
Data shifts is a term used to describe a change in the distribution of the data. This change can happen for many reasons. They can, however, be boiled down to three primary mechanisms.
- Causes leading to changes in the target variable (that aren’t linked to the explanatory variables used).
- Causes leading to changes in explanatory variables.
- Causes that change the underlying relationships and/or patterns between the target variable and the explanatories.
Each of these mechanisms is further elaborated below.
Understanding the roots and implications of data shifts is a prerequisite to solving issues related to models performing worse than expected (model performance degradation).
Zoning In
Imagine this.
You want to implement a prediction model. The process to do so is clear.
First, a tedious and thorough exploration of the data available for training and validation. Then, the development of a bunch of various machine learning models.
And at last, proper selection and implementation of a model that performs well in various validation procedures.
What could go wrong?
Well, a lot of things. And one of the most common issues for AI and ML models is shifts in the underlying distribution of data. Whether using fairly straight-forward regression models or complicated algorithms such as deep neural networks, data shifts are a common cause for headaches amongst modelers.
1. Prior Probability Shift
The first mechanism listed above is formally known as a prior probability shift.
It described a situation where the distribution of the target variable changes while the distributions of the explanatory variables (or input variables) don’t.
This can be due to a change of state. Let’s assume you wanted to predict how much of a given prescription drug pharmacies would sell to ensure efficient supply management.
You have a model that takes into account the number of people living nearby, the incidence of the disease treated by that drug, the price of the drug and alternative drugs, and the distance between pharmacies.
But something unforeseen happens that causes the demand for the said drug to change, despite no changes in the variables that explain demand.
A quick smell test – to check whether further resources should be spent at detecting a potential prior probability shift – is simply to plot the histograms of 1) the dataset used for training and validation, and 2) the new data set with which the model is underperforming.
If the histograms look anything like below, our smell test suggests a shift.

2. Covariate Shift
The second mechanism listed, changes in (the distribution of!) explanatory variables (or covariates or input variables) is a covariate shift.
In the above example, a covariate shift will happen if there is a large increase in the population living near pharmacies, but none of these (for some reason) have any conditions requiring them to demand the said drug.
A way to test for covariate shifts is to mix new data (from the period where the model is implemented but performs badly) with the training data.
Taking out random samples as test data and using the remaining data to train a model to predict whether an observation stems from the original training data or the new data.
If the original training data and the new data are indistinguishable, it’s unlikely you have a covariate shift.
If they are distinguishable, you should retrain your model.
3. Concept Drift
The third mechanism listed, changes in the relationship between the target variable and the explanatory variables, is often referred to as concept drift.
It can happen for several reasons. Amongst them are selection (a deterministic removal of observations), cyclical variations (that are undetected), and non-stationarity.
Concept drifts are common in time series. How to deal with it depends on the concrete issue, and is sometimes constrained by the available data.
A good introduction to the matter can be found here.
Takeaways
Data shifts are a common issue in various implementations of machine learning methods. From image recognition to predictive modeling.
How difficult they are to detect and solve differs from case to case.
But thinking about them, testing for them, and either dealing with them or being aware of how they propose a challenge to the well-functioning of a given model or algorithm is crucial.
So dear data scientists, beware of data shifts.