Data splitting technique to fit any Machine Learning Model

Purpose of splitting data into the different category is to avoid overfitting

Sachin Kumar
Towards Data Science

--

Photo by Alex on Unsplash

This aims to be a short 4-minute article to introduce you guys with Data splitting technique and its importance in practical projects.

Ethically, it is suggested to divide your dataset into three parts to avoid overfitting and model selection bias called -

  1. Training set (Has to be the largest set)
  2. Cross-Validation set or Development set or Dev set
  3. Testing Set

The test set can be sometimes omitted too. It is meant to get an unbiased estimate of algorithms performance in the real world. People who divide their dataset into just two parts usually call their Dev set the Test set.

We try to build a model upon training set then try to optimize hyperparameters on the dev set as much as possible then after our model is ready, we try and evaluate the testing set.

# Training Set:

The sample of data used to fit the model, that is the actual subset of the dataset that we use to train the model (estimating the weights and biases in the case of Neural Network). The model observes and learns from this data and optimize its parameters.

# Cross-Validation Set:

We select the appropriate model or the degree of the polynomial (if using regression model only) by minimizing the error on the cross-validation set.

# Test set:

The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. It is only used once the model is completely trained using the training and validation sets. Therefore test set is the one used to replicate the type of situation that will be encountered once the model is deployed for real-time use.

The test set is generally what is used to evaluate different models in competitions of Kaggle or Analytics Vidhya. Generally in a Machine Learning hackathon, the cross-validation set is released along with the training set and the actual test set is only released when the competition is about to close, and it is the score of the model on the Test set that decides the winner.

# How to decide the ratio of splitting the dataset?

Fig: Splitting of Dataset, Source — Made on infogram.com

The answer generally lies in the dataset itself. The proportions are decided according to the size and type (for time series data, splitting techniques are a bit different) of data available with us.

If the size of our dataset is between 100 to 10,00,000, then we split it in the ratio 60:20:20. That is 60% data will go to the Training Set, 20% to the Dev Set and remaining to the Test Set.

If the size of the data set is greater than 1 million then we can split it in something like this 98:1:1 or 99:0.5:0.5

The main aim of deciding the splitting ratio is that all three sets should have the general trend of our original dataset. If our dev set has very little data, then it is possible that we’ll end up selecting some model which is biased towards the trends only present in the dev set. Same is the case with training sets — too little data will bias the model towards some trends found only in that subset of the dataset.

The models that we deploy are nothing but estimators learning the statistical trends in the data. Therefore, it is important that the data that is being used to learn and that being used to validate or test the model follow as similar statistical distribution as possible. One of the ways to achieve this as perfectly as possible is to select the subsets — here the training set, the dev set and/or the test set — randomly. For example, suppose that you are working on a face detection project and face training pictures are taken from the web and the dev/test pictures are from users cell phone, then there will be a mismatch between the properties of train set and dev/test set.

One way we can divide the dataset into the train, test, cv with 0.6, 0.2, 0.2 ratios would be to use the train_test_split method twice :

from sklearn.model_selection import train_test_split x, x_test, y, y_test = train_test_split (x_train,labels, test_size=0.2, train_size=0.8 )x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25, train_size =0.75)

That’s all for this article folks — if you’ve made it this far, please comment below your experience while reading and provide feedback and also add me on LinkedIn.

--

--