Cross Validation Explained: Evaluating estimator performance.

Improve your ML model using cross validation.

Rahil Shaikh
Towards Data Science

--

The ultimate goal of a Machine Learning Engineer or a Data Scientist is to develop a Model in order to get Predictions on New Data or Forecast some events for future on Unseen data. A Good Model is not the one that gives accurate predictions on the known data or training data but the one which gives good predictions on the new data and avoids overfitting and underfitting.

After completing this tutorial, you will know:

  • That why to use cross validation is a procedure used to estimate the skill of the model on new data.
  • There are common tactics that you can use to select the value of k for your dataset.
  • There are commonly used variations on cross-validation such as stratified and LOOCV that are available in scikit-learn.
  • Practical Implementation of k-Fold Cross Validation in Python

To derive a solution we should first understand the problem. Before we proceed to Understanding Cross Validation let us first understand Overfitting and Underfitting

Understanding Underfitting and Overfitting:

Overfit Model: Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well.

Overfitting a model result in good accuracy for training data set but poor results on new data sets. Such a model is not of any use in the real world as it is not able to predict outcomes for new cases.

Underfit Model: Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Underfitting is often a result of an excessively simple model. By simple we mean that the missing data is not handled properly, no outlier treatment, removing of irrelevant features or features which do not contribute much to the predictor variable.

How to tackle Problem of Overfitting:

The answer is Cross Validation

A key challenge with overfitting, and with machine learning in general, is that we can’t know how well our model will perform on new data until we actually test it.

To address this, we can split our initial dataset into separate training and test subsets.

There are different types of Cross Validation Techniques but the overall concept remains the same,

To partition the data into a number of subsets

Hold out a set at a time and train the model on remaining set

Test model on hold out set

Repeat the process for each subset of the dataset

the process of cross validation in general

Types of Cross Validation:

•K-Fold Cross Validation

•Stratified K-fold Cross Validation

•Leave One Out Cross Validation

Let’s understand each type one by one

k-Fold Cross Validation:

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

If k=5 the dataset will be divided into 5 equal parts and the below process will run 5 times, each time with a different holdout set.

1. Take the group as a holdout or test data set

2. Take the remaining groups as a training data set

3. Fit a model on the training set and evaluate it on the test set

4. Retain the evaluation score and discard the model

At the end of the above process Summarize the skill of the model using the sample of model evaluation scores.

How to decide the value of k?

The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.

A value of k=10 is very common in the field of applied machine learning, and is recommend if you are struggling to choose a value for your dataset.

If a value for k is chosen that does not evenly split the data sample, then one group will contain a remainder of the examples. It is preferable to split the data sample into k groups with the same number of samples, such that the sample of model skill scores are all equivalent.

Stratified k-Fold Cross Validation:

Same as K-Fold Cross Validation, just a slight difference

The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.

In below image, the stratified k-fold validation is set on basis of Gender whether M or F

stratified k-fold cross validation

Leave One Out Cross Validation (LOOCV):

This approach leaves 1 data point out of training data, i.e. if there are n data points in the original sample then, n-1 samples are used to train the model and p points are used as the validation set. This is repeated for all combinations in which the original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness.

The number of possible combinations is equal to the number of data points in the original sample or n.

representation of leave one out cross validation

Cross Validation is a very useful technique for assessing the effectiveness of your model, particularly in cases where you need to mitigate over-fitting.

Implementation of Cross Validation In Python:

We do not need to call the fit method separately while using cross validation, the cross_val_score method fits the data itself while implementing the cross-validation on data. Below is the example for using k-fold cross validation.

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.model_selection import cross_val_score
#read csv file
data = pd.read_csv("D://RAhil//Kaggle//Data//Iris.csv")#Create Dependent and Independent Datasets based on our Dependent #and Independent featuresX = data[['SepalLengthCm','SepalWidthCm','PetalLengthCm']]
y= data['Species']
model = svm.SVC()accuracy = cross_val_score(model, X, y, scoring='accuracy', cv = 10)
print(accuracy)
#get the mean of each fold
print("Accuracy of Model with Cross Validation is:",accuracy.mean() * 100)

Output:

The Accuracy of the model is the average of the accuracy of each fold.

In this tutorial, you discovered why do we need to use Cross Validation, gentle introduction to different types of cross validation techniques and practical example of k-fold cross validation procedure for estimating the skill of machine learning models.

Specifically, you learned:

  • That cross validation is a procedure used to avoid overfitting and estimate the skill of the model on new data.
  • There are common tactics that you can use to select the value of k for your dataset.
  • There are commonly used variations on cross-validation, such as stratified and repeated, that are available in scikit-learn.

If you liked this blog give it some CLAPS and SHARE it with your friends, you can find more interesting articles here, stay tuned for more interesting techniques and concepts of Machine Learning.

--

--

Senior Software Engineer | Machine Learning, Node.js, Angular, C#. Loves Travelling, Photography.| Learn something new every day.