
The Machine Learning models we build are based on a simple premise: the data which we will use during inference times should have the same distribution as the data used on training.
This premise is simple but rather strong because it means we can infer new observations based on previously known ones.
Sometimes, however, some real-life effects can cause the observations to shift. One example is the Covid-19 pandemic. A financial model trained before the spread of the virus might not work well for the population after the pandemic. Families might have had reduced income, and even those that didn’t probably gave a second thought before spending their money.
These effects can be silent, and to avoid them we need to evaluate the stability of the models, or in other words, if the model is still effective after the changes in the real-life scenario.
In this article, I will show you two metrics that can do that, the Population Stability Index (PSI) and the Characteristic Stability Index (CSI).
Model Stability
In dynamic systems analysis, we can define a stable system as one that remains unchanged (or only slightly changed) in the presence of perturbations. Simply put, a stable system is robust to external changes.
One way to measure the stability of our models is by checking the population or data drift, by evaluating how the population or the features have changed in the context of the model.
There are several probable sources of population drift. Some examples can include
- A change in the socio-economic relations, such as inflation, diseases, or political changes;
- Unaccounted events, such as holidays, world cups, or even natural disasters;
- The entrance of a new competitor in the market, and/or the shift of customers;
- Changes in the offered product, or the marketing campaign.
One less commented source of data and population drift is the use of the model itself. If you develop a model to solve a business problem and the solution is effective, the circumstances are changed and the model might not have the same performance!
In the following sections, I will show how to calculate PSI and CSI in the context of the stability of machine learning models. All the code and examples are available on my GitHub repository:
Population Stability Index (PSI)
The Population Stability Index (PSI), as the name suggests, is a measurement of how much the population has changed in two different moments.
Our objective is to see how the population distribution changed in terms of the model’s predicted (independent) variable. If it is a regression model, we can use the predicted value directly, but if it is a binary classification model, we need to use the probabilities (with .predict_proba()
on scikit-learn).
First, we create bins of our test data by slicing the range of the predicted variable into sections of the same size. The number of bins is arbitrary, but it is common to use 10.
Then we compare the percentage of the population inside each bin with the "new" production data. Plotting these bins we can visually compare them:

It is clear that the new population has drifted. The initial distribution was narrower and the mean was closer to 4, as the new distribution has a mean close to 6 and is more spread.
Now we calculate the PSI for each bin by using the equation below:

The result is on the following table:

If we wish, we can take the average PSI as a single metric for the whole model. In this case, the model’s PSI is 0.1435.
Conventionally, we consider:
- PSI < 0.1 = The population hasn’t changed, and we can keep the model
- 0.1 ≤ PS1 < 0.2 = The population has slightly changed, and it is advisable to evaluate the impacts of these changes
- PSI ≥ 0.2 = The changes in population are significant, and the model should be retrained or even redesigned.
In our example, the change was significant, but not so much that needs immediate action. We still should dive deeper into the causes and we can do it with CSI.
But first, remember that we created bins of our data with fixed-size ranges? Another way to calculate PSI would be by using quantile bins. In this scenario, by using 10 bins we assure that each bin will have 10% of the initial population, and we compare with the new population:

The process here is pretty much the same, but the results can be slightly different:

The average PSI in this scenario is 0.1087, which is still in the "warning" zone, but much closer to the 0.1000 value of the "safe" zone.
If you have the time and resources, it is advisable to calculate PSI both ways.
If you are looking for the Python code to calculate PSI, please check out my GitHub, and the function below:
Characteristic Stability Index (CSI)
In the previous example, our model’s PSI was in the "warning" zone between 0.1 a 0.2. We now need to understand which features may have caused the drift. Enter CSI.
The Characteristic Stability Index (CSI) is used to evaluate the stability or drift of each feature so that we can find the problematic one. As PSI is concerned with the effects of the population drift on the model’s predictions, the CSI is concerned with understanding how the feature distributions have changed.
Using it is really simple: we just apply the same formula we used for PSI, but instead of binning the data by using the predicted variable, we use each feature to create the bins:
# Fixed
print("CSI - Fixed-size bins")
for col in sample_initial.columns:
csi_values = psi(sample_initial[col].values, sample_new[col].values, mode = 'fixed')
csi = np.mean(csi_values)
print(f'{col} -> {csi=:.4f}')
# Quantile
print("nCSI - Quantile bins")
for col in sample_initial.columns:
csi_values = psi(sample_initial[col].values, sample_new[col].values, mode = 'quantile')
csi = np.mean(csi_values)
print(f'{col} -> {csi=:.4f}')
In this example the results are:

So x1 has almost no change, x2 slightly changed but still on the "safe zone", while x3 has changed significantly and is probably the feature that caused the population drift.
Conclusion
Model Stability is not often covered by Data Science courses, but is really important when dealing with production models. I hope this article has helped you understand it and apply on your pipelines.
Also, please refer to my GitHub repository to see one regression and one classification example:
If you like this post…
Support me with a coffee!

And read this awesome post
Evaluating the potential return of a model with Lift, Gain, and Decile Analysis
References
[1] Yurdakul, B. Statistical Properties of Population Stability Index. https://scholarworks.wmich.edu/cgi/viewcontent.cgi?article=4249&context=dissertations
[2] Burke, M. Population Stability Index. [https://mwburke.github.io/data science/2018/04/29/population-stability-index.html](https://mwburke.github.io/data science/2018/04/29/population-stability-index.html)