How You Should Validate Machine Learning Models

Learn to build trust in your machine learning solutions

Published in

Towards Data Science

14 min readJul 21, 2023

https://www.shutterstock.com/image-photo/desert-island-palm-tree-on-beach-71305345

Large language models have already transformed the data science industry in a major way. One of the biggest advantages is the fact that for most applications, they can be used as is — we don’t have to train them ourselves. This requires us to reexamine some of the common assumptions about the whole machine learning process — many practitioners consider validation to be “part of the training”, which would suggest that it is no longer needed. We hope that the reader shuddered slightly at the suggestion of validation being obsolete — it most certainly is not.

Here, we examine the very idea of model validation and testing. If you believe yourself to be perfectly fluent in the foundations of machine learning, you can skip this article. Otherwise, strap in — we’ve got some far-fetched scenarios for you to suspend your disbelief on.

This article is a joint work of Patryk Miziuła, PhD and Jan Kanty Milczek.

Learning on a desert island

Imagine that you want to teach someone to recognize the languages of tweets on Twitter. So you take him to a desert island, give him 100 tweets in 10 languages, tell him what language each tweet is in, and leave him alone for a couple of days. After that, you return to the island to check whether he has indeed learned to recognize languages. But how can you examine it?

Your first thought may be to ask him about the languages of the tweets he got. So you challenge him this way and he answers correctly for all 100 tweets. Does it really mean he is able to recognize languages in general? Possibly, but maybe he just memorized these 100 tweets! And you have no way of knowing which scenario is true!

Here you didn’t check what you wanted to check. Based on such an examination, you simply can’t know whether you can rely on his tweet language recognition skills in a life-or-death situation (those tend to happen when desert islands are involved).

What should we do instead? How to make sure he learned, rather than simply memorizing? Give him another 50 tweets and have him tell you their languages! If he gets them right, he is indeed able to recognize the language. But if he fails entirely, you know he simply learned the first 100 tweets off by heart — which wasn’t the point of the whole thing.

But how is all this stuff related to machine learning models?

The story above figuratively describes how machine learning models learn and how we should check their quality:

The man in the tale stands for a machine learning model. To disconnect a human from the world you need to take him to a desert island. For a machine learning model it’s easier — it’s just a computer program, so it doesn’t inherently understand the idea of the world.
Recognizing the language of a tweet is a classification task, with 10 possible classes, aka categories, as we chose 10 languages.
The first 100 tweets used for learning are called the training set. The correct languages attached are called labels.
The other 50 tweets only used to examine the man/model are called the test set. Note that we know its labels, but the man/model doesn’t.

The graph below shows how to correctly train and test the model:

Image 1: scheme for training and testing the model properly. Image by author.

So the main rule is:

Test a machine learning model on a different piece of data than you trained it on.

If the model does well on the training set, but it performs poorly on the test set, we say that the model is overfitted. “Overfitting” means memorizing the training data. That’s definitely not what we want to achieve. Our goal is to have a trained model — good for both the training and the test set. Only this kind of model can be trusted. And only then may we believe that it will perform as well in the final application it’s being built for as it did on the test set.

Now let’s take it a step further.

1000 men on 1000 desert islands

Imagine you really really want to teach a man to recognize the languages of tweets on Twitter. So you find 1000 candidates, take each to a different desert island, give each the same 100 tweets in 10 languages, tell each what language each tweet is in and leave them all alone for a couple of days. After that, you examine each candidate with the same set of 50 different tweets.

Which candidate will you choose? Of course, the one who did the best on the 50 tweets. But how good is he really? Can we truly believe that he’s going to perform as well in the final application as he did on these 50 tweets?

The answer is no! Why not? To put it simply, if every candidate knows some answers and guesses some of the others, then you choose the one who got the most answers right, not the one who knew the most. He is indeed the best candidate, but his result is inflated by “lucky guesses.” It was likely a big part of the reason why he was chosen.

To show this phenomenon in numerical form, imagine that 47 tweets were easy for all the candidates, but the 3 remaining messages were so hard for all the competitors that they all simply guessed the languages blindly. Probability says that the chance that somebody (possibly more than one person) got all the 3 hard tweets is above 63% (info for math nerds: it’s almost 1–1/e). So you’ll probably choose someone who scored perfectly, but in fact he’s not perfect for what you need.

Perhaps 3 out of 50 tweets in our example don’t sound astonishing, but for many real-life cases this discrepancy tends to be much more pronounced.

So how can we check how good the winner actually is? Yes, we have to procure yet another set of 50 tweets, and examine him once again! Only this way will we get a score we can trust. This level of accuracy is what we expect from the final application.

Let’s go back to machine learning terminology

In terms of names:

The first set of 100 tweets is now still the training set, as we use it to train the models.
But now the purpose of the second set of 50 tweets has changed. This time it was used to compare different models. Such a set is called the validation set.
We already understand that the result of the best model examined on the validation set is artificially boosted. This is why we need one more set of 50 tweets to play the role of the test set and give us reliable information about the quality of the best model.

You can find the flow of using the training, validation and test set in the image below:

Image 2: scheme for training, validating and testing the models properly. Image by author.

All right, and why did we use sets of exactly 100, 50 and 50 tweets?

Here are the two general ideas behind these numbers:

Put as much data as possible into the training set.

The more training data we have, the broader the look the models take and the greater the chance of training instead of overfitting. The only limits should be data availability and the costs of processing the data.

Put as small an amount of data as possible into the validation and test sets, but make sure they’re big enough.

Why? Because you don’t want to waste much data for anything but training. But on the other hand you probably feel that evaluating the model based on a single tweet would be risky. So you need a set of tweets big enough not to be afraid of score disruption in case of a small number of really weird tweets.

And how to convert these two guidelines into exact numbers? If you have 200 tweets available then the 100/50/50 split seems fine as it obeys both the rules above. But if you’ve got 1,000,000 tweets then you can easily go into 800,000/100,000/100,000 or even 900,000/50,000/50,000. Maybe you saw some percentage clues somewhere, like 60%/20%/20% or so. Well, they are only an oversimplification of the two main rules written above, so it’s better to simply stick to the original guidelines.

OK, but how to choose which tweets will go into the training/validation/test set?

We believe this main rule appears clear to you at this point:

Use three different pieces of data for training, validating, and testing the models.

So what if this rule is broken? What if the same or almost the same data, whether by accident or a failure to pay attention, go into more than one of the three datasets? This is what we call data leakage. The validation and test sets are no longer trustworthy. We can’t tell whether the model is trained or overfitted. We simply can’t trust the model. Not good.

Perhaps you think these problems don’t concern our desert island story. We just take 100 tweets for training, another 50 for validating and yet another 50 for testing and that’s it. Unfortunately, it’s not so simple. We have to be very careful. Let’s go through some examples.

Example 1: many random tweets

Assume that you scraped 1,000,000 completely random tweets from Twitter. Different authors, time, topics, localizations, numbers of reactions, etc. Just random. And they are in 10 languages and you want to use them to teach the model to recognize the language. Then you don’t have to worry about anything and you can simply draw 900,000 tweets for the training set, 50,000 for the validation set and 50,000 for the test set. This is called the random split.

Why draw at random, and not put the first 900,000 tweets in the training set, the next 50,000 in the validation set and the last 50,000 in the test set? Because the tweets can initially be sorted in a way that wouldn’t help, such as alphabetically or by the number of characters. And we have no interest in only putting tweets starting with ‘Z’ or the longest ones in the test set, right? So it’s just safer to draw them randomly.

Image 3: random data split. Image by author.

The assumption that the tweets are completely random is strong. Always think twice if that’s true. In the next examples you’ll see what happens if it’s not.

Example 2: not so many random tweets

If we only have 200 completely random tweets in 10 languages then we can still split them randomly. But then a new risk arises. Suppose that a language is predominant with 128 tweets and there are 8 tweets for each of the other 9 languages. Probability says that then the chance that not all the languages will go to the 50-element test set is above 61% (info for math nerds: use the inclusion-exclusion principle). But we definitely want to test the model on all 10 languages, so we definitely need all of them in the test set. What should we do?

We can draw tweets class-by-class. So take the predominant class of 128 tweets, draw the 64 tweets for the training set, 32 for the validation set and 32 for the test set. Then do the same for all the other classes — draw 4, 2 and 2 tweets for training, validating and testing for each class respectively. This way, you’ll form three sets of the sizes you need, each with all classes in the same proportions. This strategy is called the stratified random split.

The stratified random split seems better/safer than the ordinary random split, so why didn’t we use it in Example 1? Because we didn’t have to! What often defies intuition is that if 5% out of 1,000,000 tweets are in English and we draw 50,000 tweets with no regard for language, then 5% of the tweets drawn will also be in English. This is how probability works. But probability needs big enough numbers to work properly, so if you have 1,000,000 tweets then you don’t care, but if you only have 200, watch out.

Example 3: tweets from several institutions

Now assume that we’ve got 100,000 tweets, but they are from only 20 institutions (let’s say a news TV station, a big soccer club, etc.), and each of them runs 10 Twitter accounts in 10 languages. And again our goal is to recognize the Twitter language in general. Can we simply use the random split?

You’re right — if we could, we wouldn’t have asked. But why not? To understand this, first let’s consider an even simpler case: what if we trained, validated and tested a model on tweets from one institution only? Could we use this model on any other institution’s tweets? We don’t know! Maybe the model would overfit the unique tweeting style of this institution. We wouldn’t have any tools to check it!

Let’s return to our case. The point is the same. The total number of 20 institutions is on the small side. So if we use data from the same 20 institutions to train, compare and score the models, then maybe the model overfits the 20 unique styles of these 20 institutions and will fail on any other author. And again there is no way to check it. Not good.

So what to do? Let’s follow one more main rule:

Validation and test sets should simulate the real case which the model will be applied to as faithfully as possible.

Now the situation is clearer. Since we expect different authors in the final application than we have in our data, we should also have different authors in the validation and test sets than we have in the training set! And the way to do so is to split data by institutions! If we draw, for example, 10 institutions for the training set, another 5 for the validation set and put the last 5 in the test set, the problem is solved.

Image 4: stratified data split. Image by author.

Note that any less strict split by institution (like putting the whole of 4 institutions and a small part of the 16 remaining ones in the test set) would be a data leak, which is bad, so we have to be uncompromising when it comes to separating the institutions.

A sad final note: for a correct validation split by institution, we may trust our solution for tweets from different institutions. But tweets from private accounts may — and do — look different, so we can’t be sure the model we have will perform well for them. With the data we have, we have no tool to check it…

Example 4: same tweets, different goal

Example 3 is hard, but if you went through it carefully then this one will be fairly easy. So, assume that we have exactly the same data as in Example 3, but now the goal is different. This time we want to recognize the language of other tweets from the same 20 institutions that we have in our data. Will the random split be OK now?

The answer is: yes. The random split perfectly follows the last main rule above as we are ultimately only interested in the institutions we have in our data.

Examples 3 and 4 show us that the way we should split the data does not depend only on the data we have. It depends on both the data and the task. Please bear that in mind whenever you design the training/validation/test split.

Example 5: still the same tweets, yet another goal

In the last example let’s keep the data we have, but now let’s try to teach a model to predict the institution from future tweets. So we once again have a classification task, but this time with 20 classes as we’ve got tweets from 20 institutions. What about this case? Can we split our data randomly?

As before, let’s think about a simpler case for a while. Suppose we only have two institutions — a TV news station and a big soccer club. What do they tweet about? Both like to jump from one hot topic to another. Three days about Trump or Messi, then three days about Biden and Ronaldo, and so on. Clearly, in their tweets we can find keywords that change every couple of days. And what keywords will we see in a month? Which politician or villain or soccer player or soccer coach will be ‘hot’ then? Possibly one that is completely unknown right now. So if you want to learn to recognize the institution, you shouldn’t focus on temporary keywords, but rather try to catch the general style.

OK, let’s move back to our 20 institutions. The above observation remains valid: the topics of tweets change over time, so as we want our solution to work for future tweets, we shouldn’t focus on short-lived keywords. But a machine learning model is lazy. If it finds an easy way to fulfill the task, it doesn’t look any further. And sticking to keywords is just such an easy way. So how can we check whether the model learned properly or just memorized the temporary keywords?

We’re pretty sure you realize that if you use the random split, you should expect tweets about every hero-of-the-week in all the three sets. So this way, you end up with the same keywords in the training, validation and test sets. This is not what we’d like to have. We need to split smarter. But how?

When we go back to the last main rule, it becomes easy. We want to use our solution in future, so validation and test sets should be the future with respect to the training set! We should split data by time. So if we have, say, 12 months of data — from July 2022 up to June 2023 — then putting July 2022 — April 2023 in the test set, May 2023 in the validation set and June 2023 in the test set should do the job.

Image 5: data split by time. Image by author.

Maybe you are concerned that with the split by time we don’t check the model’s quality throughout the seasons. You’re right, that’s a problem. But still a smaller problem than we’d get if we split randomly. You can also consider, for example, the following split: 1st-20th of every month to the training set, 20th-25th of every month to the validation set, 25th-last of every month to the test set. In any case, choosing a validation strategy is a trade-off between potential data leaks. As long as you understand it and consciously choose the safest option, you’re doing well.

Summary

We set our story on a desert island and tried our best to avoid any and all complexities — to isolate the issue of model validation and testing from all possible real-world considerations. Even then, we stumbled upon pitfall after pitfall. Fortunately, the rules for avoiding them are easy to learn. As you’ll likely learn along the way, they are also hard to master. You will not always notice the data leak immediately. Nor will you always be able to prevent it. Still, careful consideration of the believability of your validation scheme is bound to pay off in better models. This is something that remains relevant even as new models are invented and new frameworks are released.

Also, we’ve got 1000 men stranded on desert islands. A good model might be just what we need to rescue them in a timely manner.