The world’s leading publication for data science, AI, and ML professionals.

Why Do We Need a Validation Set in Addition to Training and Test Sets?

Training, validation and test sets explained in plain English

Photo by Shumilov Ludmila on Unsplash
Photo by Shumilov Ludmila on Unsplash

You may already be familiar with training and test sets. When training ML and DL models, you often split the entire dataset into training and test sets.

This is because you need a separate test set to evaluate your model on unseen data to increase the generalizing capability of the model.

We do not test our model on the same data used for training. If we do so, the model will try to memorize data and will not generalize on new unseen data.

The validation set is also a part of the original dataset. Just like the test set, it is used to evaluate the model. However, this is not the final evaluation.

Machine learning is a highly iterative process – Andrew Ng

Machine Learning is not a one-time process. You need to experiment with different models by setting different values for the hyperparameters before finding the best model you’re looking for.

This is where the validation set comes into play.

Training set vs validation set vs test set

Training, testing and validation are key steps in the ML workflow. For each step, we need a separate dataset. Therefore, the entree dataset is divided into the following parts.

  • Training set: This is the largest part in terms of the size of the dataset. The training set is used to train (fit) the model. The model parameters learn their values (rules or patterns) from training data. In other words, the training set is used to fit the parameters of the model on a fixed combination of hyperparameters.
  • Validation Set: Our model training process is not a one-time process. We have to train multiple models by trying different combinations of hyperparameters. Then, we evaluate the performance of each model on the validation set. Therefore, the validation test is useful for hyperparameter tuning or selecting the best model out of different models. In some contexts, the validation set is also called the Development (Dev) set.
  • Test set: After the tuning process, we select the best model with an optimal combination of hyperparameters. We measure the performance of that model using the test set.

It is straightforward to understand the roles of training and test sets. You may not yet be familiar with the role of the validation set – the title of today’s article.

Let’s look at an example.

Let’s say we want to train a random forest classifier on a dataset by trying out different values for n_estimators and max_depth hyperparameters.

from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=?, max_depth=?)

The default value for n_estimators is 100 and for max_depth is None. But we are not interested in these default values. Instead, we want to try out the following values.

  • n_estimators: 100, 500, 1000 (3 different values)
  • max_depth: 2, 3, 5, 10 (4 different values)

There are 12 (3 x 4) different combinations of hyperparameters. It means that we’ll build 12 different random forest classifiers by considering each hyperparameter combination at a time. For example,

rf_clf_1 = RandomForestClassifier(n_estimators=100, max_depth=2)

We train the first model on the training set, evaluate its performance on the validation set, record its performance score and keep it aside.

rf_clf_2 = RandomForestClassifier(n_estimators=100, max_depth=3)

We train the second model on the same training set, evaluate its performance on the same validation set, record its performance score and keep it aside.

Likewise, we train all 12 models and record performance scores. Then, we select the model with the best performance score and note down its hyperparameter values. Let’s say those hyperparameter values are n_estimators=500 and max_depth=3 .

rf_clf_best = RandomForestClassifier(n_estimators=500, max_depth=3)

Finally, we evaluate this model on the test set.

In summary, the training set is used to fit the model parameters, the validation set is used to tune model hyperparameters. Finally, we use the test set to evaluate the best model.

One can ask why we need a separate validation set. Can’t we tune the model hyperparameters using the training set instead of using a separate validation set? The answer is that we also do some sort of testing using the validation test and testing shouldn’t be done on the same data used for training.

Using validation and test sets will increase the generalizing capability of the model on new unseen data.

Also, note that the validation set is not needed (redundant) if you’re not going to tune the model by trying different combinations of hyperparameters. In that case, you may continue the training process by using just training and test sets.

How to create training, validation and tests sets

Now, you’re familiar with the usage of training, validation and test sets. In this section, we discuss how to create these sets in Python.

We’re going to discuss 3 different methods of creating training, validation and test sets.

1. Using the Scikit-learn train_test_split() function twice

You may already be familiar with the Scikit-learn train_test_split() function. Here, we use it twice to create training, validation and test sets. Here is how to do it.

First, we create the training set by allocating 70% of the samples in the original dataset. Therefore, the train_size is 0.70.

from sklearn.model_selection import train_test_split
X_train, X_rem, y_train, y_rem = train_test_split(X, y,
                                                  train_size=0.70)

The training set includes X_train and y_train parts. The X_rem and y_rem parts belong to the remaining dataset that is used to create validation and test sets in the next step.

X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem,
                                                    test_size=0.15)

The validation set includes X_valid and y_valid parts. The X_test and y_test parts belong to the test set. Here, we’ve used test_size=0.15. It means that we get 15% of the samples in the original dataset for each validation and test set.

2. Using the Fast-ML train_valid_test_split() function

In the above method, you have to call the train_test_split() function twice to create training, validation and test sets. By using the train_valid_test_split() function in the Fast-ML library, you can create all the sets with a single function call!

X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(X, y, train_size=0.70, valid_size=0.15, test_size=0.15)

3. Using the Scikit-learn GridSearchCV() and RandomizedSearchCV() functions

Here, we do not need to explicitly create the validation set as these functions create it behind the scenes along with k-fold cross-validation.

First, we need to split the dataset into train and test sets by using the Scikit-learn train_test_split() function.

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.15)

The test set contains 15% of the samples in the original dataset and the training set contains 85% of the samples.

Then, we execute GridSearchCV() or RandomizedSearchCV() function with the training set.

from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(cv=5)
gs.fit(X_train, y_train)

After we find the best model, we test it using the test set that we created earlier.

By setting cv=5 in the GridSearchCV() function, the algorithm splits the training set into 5 folds each having 17% (85/5) of the samples. In each iteration, one fold is kept aside as the validation set as shown in the following diagram.

(Image by author)
(Image by author)

In each iteration, the validation set is changed as shown in the image. When calculating the performance score, the average will be taken.

If you’re not familiar with k-fold cross-validation, Grid Search or Random Search, please read the following articles written by me.

k-fold cross-validation explained in plain English

Python Implementation of Grid Search and Random Search for Hyperparameter Optimization

How large do training, validation and test datasets need to be?

This is a good question and also difficult to give an exact answer for it as the set sizes depend on the following factors.

  • Amount of data you have
  • How well the model should perform
  • Detect small changes in the performance score

One important rule is that you should allocate as much of the data as possible for the training set. The more data you allocate for the training set, the better the model learns the rules from data.

Another rule is that you should always shuffle the dataset before splitting.

Finally, each set should be a good representative sample of the original dataset.

The Scikit-learn’s default for the test set size is 25% of the original data.

For small datasets with only hundreds or thousands of rows, it is better to use 0.8 : 0.1 : 0.1 or 0.7 : 0.15 : 0.15 for training, validation and test sets.

For large datasets with millions or billions or rows, you need not allocate a larger percentage of data for validation and test sets. It is better to use 0.98 : 0.01 : 0.01 or 0.96 : 0.02 : 0.02 for training, validation and test sets.

The validation (dev) set should be large enough to detect differences between algorithms that you are trying out – Andrew Ng

The validation set is used for hyperparameter tuning. It should be large enough to capture an even small change in the performance score so that the best model is stand out.

Summary

Now, you have a clear idea about training, validation and test sets. As a summary, note down the following things.

  • The training set is used for model training (learning parameters)
  • The validation set is used for hyperparameter tuning.
  • The test set is used for the final evaluation of the best model.
  • The validation set is not needed (redundant) if you’re not going to perform hyperparameter tuning.
  • GridSearchCV() and RandomizedSearchCV() functions create the validation set behind the scenes. So, we do not need to explicitly create the validation set when using these functions.

This is the end of today’s article. If you have any questions regarding this article, please let me know in the comment section.

Thanks for reading!

See you in the next article! As always, happy learning to everyone!


Become a member

If you’d like, you can sign up for a membership to get full access to every story I write and I will receive a portion of your membership fee.

Join Medium with my referral link – Rukshan Pramoditha

Subscribe to my email list

Never miss a great story again by subscribing to my email list. You’ll receive every story in your inbox as soon as I hit the publish button.

Get an email whenever Rukshan Pramoditha publishes.

Rukshan Pramoditha 2022–04–11


Related Articles