
In most binary classification problems we use the ROC Curve and ROC AUC score as measurements of how well the model separates the predictions of the two different classes. I explain this mechanism in another article, but the intuition is easy: if the model gives lower probability scores for the negative class, and higher scores for the positive class, we can say that this is a good model.
Now here’s the catch: we can also use the KS-2samp test to do that!
Kolmogorov-Smirnov (KS) test
The KS statistic for two samples is simply the highest distance between their two CDFs, so if we measure the distance between the positive and negative class distributions, we can have another metric to evaluate classifiers.
There is a benefit for this approach: the ROC AUC score goes from 0.5 to 1.0, while KS Statistics range from 0.0 to 1.0. For business teams, it is not intuitive to understand that 0.5 is a bad score for ROC AUC, while 0.75 is "only" a medium one. There is also a pre-print paper [1] that claims KS is simpler to calculate.
If you wish to understand better how the KS test works, check out my article about this subject:
Comparing sample distributions with the Kolmogorov-Smirnov (KS) test
Experiment
Let’s get to work.
All the code is available on my github, so I’ll only go through the most important parts.
As an example, we can build three datasets with different levels of separation between classes (see the code to understand how they were built).

On the "good" dataset, the classes don’t overlap, and they have a good noticeable gap between them. On the "medium" one there is enough overlap to confuse the classifier. The overlap is so intense on the "bad" dataset that the classes are almost inseparable.
I trained a default Naïve Bayes classifier for each dataset. We can see the distributions of the predictions for each class by plotting histograms. On the x-axis we have the probability of an observation being classified as "positive" and on the y-axis the count of observations in each bin of the histogram:

The "good" example (left) has a perfect separation, as expected. The "medium" one (center) has a bit of an overlap, but most of the examples could be correctly classified. The classifier could not separate the "bad" example (right), though.
We can now evaluate the KS and ROC AUC for each case:
And the output is:
Good classifier:
KS: 1.0000 (p-value: 7.400e-300)
ROC AUC: 1.0000
Medium classifier:
KS: 0.6780 (p-value: 1.173e-109)
ROC AUC: 0.9080
Bad classifier:
KS: 0.1260 (p-value: 7.045e-04)
ROC AUC: 0.5770
The good (or should I say perfect) classifier got a perfect score in both metrics.
The medium one got a ROC AUC of 0.908 which sounds almost perfect, but the KS score was 0.678, which reflects better the fact that the classes are not "almost perfectly" separable.
Finally, the bad classifier got an AUC Score of 0.57, which is bad (for us data lovers that know 0.5 = worst case) but doesn’t sound as bad as the KS score of 0.126.
We can also check the CDFs for each case:

As expected, the bad classifier has a narrow distance between the CDFs for classes 0 and 1, since they are almost identical. The medium classifier has a greater gap between the class CDFs, so the KS statistic is also greater. Lastly, the "perfect" classifier has no overlap on their CDFs, so the distance is maximum and KS = 1.
Effect of data unbalance
And how does data unbalance affect KS score?
To test this we can generate three datasets based on the "medium" one:
- The original, where the positive class has 100% of the original examples (500)
- A dataset where the positive class has 50% of the original examples (250)
- A dataset where the positive class has only 10% of the original examples (50)
In all three cases, the negative class will be unchanged with all the 500 examples. After training the classifiers we can see their histograms, as before:

The negative class is basically the same, while the positive one only changes in scale.
We can use the same function to calculate the KS and ROC AUC scores:
print("Balanced data:")
ks_100, auc_100 = evaluate_ks_and_roc_auc(y_100, y_proba_100)
print("Positive class with 50% of the data:")
ks_50, auc_50 = evaluate_ks_and_roc_auc(y_50, y_proba_50)
print("Positive class with 10% of the data:")
ks_10, auc_10 = evaluate_ks_and_roc_auc(y_10, y_proba_10)
The output is:
Balanced data:
KS: 0.6780 (p-value: 1.173e-109)
ROC AUC: 0.9080
Positive class with 50% of the data:
KS: 0.6880 (p-value: 3.087e-79)
ROC AUC: 0.9104
Positive class with 10% of the data:
KS: 0.6280 (p-value: 1.068e-17)
ROC AUC: 0.8837
Even though in the worst case the positive class had 90% fewer examples, the KS score, in this case, was only 7.37% lesser than on the original one. Both ROC and KS are robust to data unbalance.
Multiclass classification evaluation
As it happens with ROC Curve and ROC AUC, we cannot calculate the KS for a multiclass problem without transforming that into a binary classification problem. We can do that by using the "OvO" and the "OvR" strategies.
You can find the code snippets for this on my GitHub repository for this article, but you can also use my article on Multiclass ROC Curve and ROC AUC as a reference:
Multiclass classification evaluation with ROC Curves and ROC AUC
Conclusion
The KS and the ROC AUC techniques will evaluate the same metric but in different manners. Even if ROC AUC is the most widespread metric for class separation, it is always useful to know both.
I only understood why I needed to use KS when I started working in a place that used it. It is more a matter of preference, really, so stick with what makes you comfortable.
If you like this post…
Support me with a coffee!

And read this awesome post
References
[1] Adeodato, P. J. L., Melo, S. M. On the equivalence between Kolmogorov-Smirnov and ROC curve metrics for binary classification.
[2] Scipy Api Reference. scipy.stats.ks_2samp.
[3] Scipy Api Reference. scipy.stats.ks_1samp.
[4] Scipy Api Reference. scipy.stats.kstwo.
[5] Trevisan, V. Interpreting ROC Curve and ROC AUC for Classification Evaluation.