The world’s leading publication for data science, AI, and ML professionals.

Evaluating Train-Test Split Strategies in Machine Learning: Beyond the Basics

Creating Appropriate Test Sets and Sleeping Soundly.


With this article, I want to examine a question often overlooked by both those who ask it and those who answer: "How do you partition a dataset into training and test sets?"

When approaching a supervised problem, it is common practice to split the dataset into (at least) two parts: the training set and the test set. The training set is used for studying the phenomenon, while the test set is used to verify whether the learned information can be replicated on "unknown" data, i.e., data not present in the previous phase.

Many people typically follow standard, obvious approaches to make this decision. The common, unexciting answer is: "I randomly partition the available data, reserving 20% to 30% for the test set."

Those who go further add the concept of stratified random sampling: that is, sampling randomly while maintaining fixed proportions with one or more variables. Imagine we are in a binary classification context and have a target variable with a prior probability of 5%. Random sampling stratified on the target variable means obtaining a training set and a test set that maintain the 5% proportion on the target variable’s prior.

Reasoning of this kind is sometimes necessary, for example, in the case of classification in a very imbalanced context, but they don’t add much excitement to the matter.

The question is more difficult than it seems, and the correct answer draws from the banal: "It depends."

For simplicity, let’s continue the discussion in a binary classification context.

Have you ever had to train an algorithm to predict a mutable phenomenon and evolve?

Complexity of Partitioning in Real-World Contexts

Imagine having to create an algorithm capable of predicting the propensity to purchase a product. In this case, it is tough to distinguish the real from the personal, or the subjective from the objective. Imagine having a dataset that provides information about consumers. It is not strange to think that strange phenomena might occur, such as very similar consumers behaving differently by buying or not buying your product. This is true because the available data likely cannot capture all the underlying motivations for the decision we want to model. In other words, we are saying that some factors might be exogenous or even pseudo-random.

Image created in Midjourney
Image created in Midjourney

The phenomenon itself depends on the market therefore mutable. What happens if a competitor drastically lowers prices? What happens if a new product starts stealing market share? What happens if the product goes out of fashion and its consumption slowly begins to decline?

I have good news for you: despite all this bad news, you can still do something good to support your business and create a performant model, but you need the right precautions.

An impatient reader might think: but what does all this have to do with the division between training and test sets? It has a lot to do with it, but before connecting all the dots, let me philosophize for a second about the concept of a test set.

The test set should be as close as possible to the real dataset that the model will face once in production. Therefore, it should serve as the final bulwark and be used until all the most hidden doubts are resolved. Good performance on the test set should allow us to sleep soundly once the model is used, and thus it should be constructed to prepare us for the goal of production deployment.

One possible strategy could be to split the training and test sets based on time. For example, leaving the last two months for the test set. This reasoning is based on the concept that the previous few months inevitably represent the best representation of the real data that will arrive in the production phase. It might be reasonable to think that the most recent data available is the closest to the unknown data we will face in the future. Yet, this strategy is not without pitfalls. Addressing the problem in this way introduces a temporal distortion in the model evaluation.

Let’s think about an extreme case like a strongly seasonal phenomenon. Imagine the historical series of ice cream sales. Relying completely on the last available months would mean making inefficient decisions. The risk is favoring a model that adapts more to the previous moments, rather than a model better at generalization. Imagine a temporal split that leaves the summer months in the test set; the risk is favoring models with higher predictions than the average, and vice versa in the winter periods.

Image created in Midjourney
Image created in Midjourney

This is because, for every seasonal phenomenon, the initial hypothesis no longer holds. In this case, it is not true that the last moments of our dataset are the closest to the production data. Probably, in this case, the closest data to the production data might be the closest data from the previous year, or better, of the prior season. In this scenario, I prefer to sample by temporally stratifying the phenomenon to evaluate performance across different time components.

Advanced Strategies and Final Considerations

I will attempt to clarify with an example. We randomly select 25% of the dataset while maintaining the proportions of months and years unchanged. This allows us to assess the stability of the phenomenon by analyzing the test performance across different time components.

In certain specific scenarios, I chose to do both: introduce a random element to the entire dataset for the test set, as well as a temporal element limited to the last period.

I wanted to understand how the "old" data could predict both new data in the same time period (the "old" data present in the test set) and new data in the immediate next moment (last temporal period). If there is a significant difference in performance between the two, it could indicate a decrease in the reliability of the predictive data or a potential shift in the patterns of the phenomenon being studied.

I find it very likely that in your case a possible good strategy could be another one, perhaps one still to be discovered!

What I am trying to suggest is that unfortunately, there is no golden rule to follow, but the only viable path is to try to build the test dataset carefully based on the problem being analyzed.

In attempting to untangle this complicated web, I try to do so by keeping these questions in mind:

  • How can I create a test dataset that closely resembles the production data?
  • Does the strategy I have chosen allow me to fully evaluate the model’s performance as thoroughly as possible?
  • Will my dataset allow me to sleep peacefully when I release the solution?

Remember that these questions are applicable even in cases involving non-mutable phenomena.

Let’s take one of the most famous playgrounds in the field as an example: imagine facing the problem of classifying images of dogs and cats. In this case, there is nothing mutable. Remember this: A dog will always be a dog, and a cat will always be a cat. However, if we only include poodles in the test set, we will choose the best model for predicting poodles without being able to make predictions about dachshunds.

Unfortunately, the only rule is that there are no rules.


Related Articles