Hands-on Tutorials

After training a Machine Learning model, every data scientist always want to know how well the trained model will perform on the unseen data. A good model is a model that performs well in, not only the training but also test data. To estimate the model performance, we often use part of the data for training and hold-out some for testing with the hope that the performance of the model on the testing data is representative of the data in the universe.
The following is an example of a simple classification problem. In this example Iris dataset is loaded from Sklearn module and Logistic Regression model is fit into the data. The data contains 150
records – 60%
used for training and 40%
for testing.
Output:
Model 0 accuracy: 0.967
Model 1 accuracy: 0.967
Model 2 accuracy: 0.933
Model 3 accuracy: 0.967
Model 4 accuracy: 0.933
Model 5 accuracy: 0.933
Model 6 accuracy: 0.85
Model 7 accuracy: 0.95
Model 8 accuracy: 1.0
Model 9 accuracy: 0.95
Line 1–15:
In these lines, data is loaded and modules imported.
Line 17:
This is a loop fitting Logistic Model into the same data at each iteration. Important thing to note is that the model performs differently at each iteration (See the model scores above). Note that the difference is not attributed to any parameter change during the iteration because the same parameters are used for all the models. The source of difference is on the splitting of the data. train_test_split
splits the data into 2
but at each iteration different data is used for training and for testing because of shuffling without seeding. If you want a consistent result at each iteration pass a seed value to the random _state
parameter of sklearn.model_selection.train_test_split. You can rewrite line25
as follows
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.40, random_state=42)
This allows the model results to be reproducible, that is, consistent. This is because shuffling during splitting is seeded and therefore the data used for training and testing are consistent over all iterations.
A classical Validation procedure involves doing things exactly as done in the above example – split the data into two sets : train and test sets – but one run on the model scoring might not be conclusive of how the model might perform on the production. For this reason, we need a better way of validation. This is exactly where Cross-validation comes in.
Cross-validation
Cross-validation is a technique used to validate a machine learning model and estimate the performance of the trained model on the unseen data. It is better than classical evaluation as you will see in the following discussion.
Cross-validation procedure
- [Optional] shuffle the data.
- Split the data into
k
equally-sized (or roughly) distinct groups/folds (for this reason, cross-validation is also called k-fold cross-validation). - Run through
k
iterations of model training and validation. Fori
in1,2,...,i,...k
✓ Of the k
-folds, a single fold (i
-th folder) is used as hold-out/validation data and the other k-1
splits used for training the model.
✓ Once the model is validated, retain the model score,E_i
.
- Aggregate the scores obtain in (
3
point two) to obtain the average performance of the model for all thek
folds. In most cases, aggregation is done using arithmetic mean such that:

For classification tasks, the model can be scored using metrics like accuracy, precision, recall, e.t.c. and for regression problems metrics like mean squared error, mean absolute error, e.t.c can be used to score the model in each iteration.
- Analyze average score,
E
, to determine the likelihood of it performing well on the unseen data (the universe).
A simple implementation of k-fold Cross-validation using Sklearn
Output:
train: [ 2 3 4 5 6 7 8 11 12], test: [1 9 10]
train: [ 1 3 5 6 7 9 10 11 12], test: [2 4 8]
train: [ 1 2 4 5 8 9 10 11 12], test: [3 6 7]
train: [ 1 2 3 4 6 7 8 9 10], test: [5 11 12]

Below is an animation of Cross-validation process sourced from Wikipedia.

The choice of k
- First of all,
k
must be an integer between2
andn
(number of observations/records).k
must be at least2
to ensure that there are at least two folds. - For
k=2
, we randomly shuffle data (optional) and split it into2
sets –d1
andd2
so that both sets are are of equal sizes (or roughly). The model is then trained ond1
and validated ond2
, followed by training ond2
and validation ond1
. - Things can be taken to the extreme by choosing
k=n
. In this scenario, we have a special case of Cross-validation called Leave-One-Out Cross-Validation (LOOCV). The name is derived from the fact that in each iteration one data point is left out for validation and the othern-1
are used for testing. - In Cross-validation
k
is unfixed parameter but the following points are should be considered when choosingk
:
✓ Representativeness heuristic – k
should be chosen in such a way that the hold-out data/fold is representative of the universe (unseen data). For example, if we have 1000
records and we choose k=500
then at every iteration only 1000/500=2
data points are used for validation. This makes a very small validation size. A very large value of k
means less variance across training folds and therefore limiting the model differences across iterations. This means that k
should not neither be too large nor too small.
✓ Rule of thumb – Despite the fact that k
is unfixed parameter (no specific formula can be used to determine the best choice), 10
is commonly used because it has been experimentally proven that it is a good choice most of the times.
- Given
n
observations in the original data, each fold will containn/k
records.
Here is also another animation from Wikipedia demonstrating the concept of Leave-One-Out Cross-Validation:

Reason for Using Cross-validation Approach
- Model’s performance in training or even on one ran on the validation set may not be guarantee of best performance of the model on the unseen data. This happens when the data is not large enough for it to be representative of the state of universe. In this case, the model error on the test data may not reflect how the model will perform in the universe. This is exactly the case in the above example – using
150
data records to train and test the model yielded inconsistent scores. - In the absence of large datasets, therefore, cross validation is the best option to evaluate model performance.
Summary of features of k-Fold Cross-validation
- Shuffling (which is an optional operation) is done before splitting the data into
k
folds. This means that a given data point is assigned into a single fold and it stays in that group for the rest of the Cross-validation procedure. It also means that a given data point is used once for validation andk-1
times for training. - There are
n/k
data points in each fold wheren
is the number of observations in the original data andk
is the number of folds. k
can take any integer value between2
and then
(the number of observations in the original sample).
Bootstrap Sampling
Bootstrap sampling is a resampling technique that involves random sampling with replacement. The word resample in literal terms means ‘sample again’- implying that- a bootstrap sample is generated by sampling with replacement from ‘original’ sample.
Normally, sampling is generated by selecting a subset of the population for analysis with the intention of making inference about the population (population ->sample) whereas resampling is done for the purposes of making inference about the sample (sample -> resampled data).
Here is a simple example. Assuming that a sample of size 12
is taken from a given population and out of these sample 3
bootstrap samples are generated. We can then represent this information as follows:

12
data examples and each sample sets involve also sampling 12 data points from the original data with replacement. Source: Author.Since we are conductive sampling with replacement, notice the following from above example:
- Some data points (may) appear in more than one set. For example
6
appears in setD2
andD3
. - Some examples will appear more than once in a given set. For example
9, 10, 4, 3
and7
appears more than once in setD1
.
"Assuming n (size of the original data) is sufficiently large, for all practical purposes there is virtually zero probability that it (the bootstrap sample) will be identical to the original ‘real’ sample." – Wikipedia
In fact,
On average, 63.22% of the original data appear in any given bootstrap sample, that is the same as saying — an average bootstrap sample omits 100–63.22=36.78% of the data on the original sample.
Lets proof this mathematical fact in the most simplified way. And before we do that, let us state some mathematical facts that will help us in the proving process.

We can now continue with our proof.
Suppose the original data has n
observations and we get bootstrap sample of size n
from the data then:
- the probability of a given observation NOT being chosen is
(1–1/n)
and the probability of it being picked is(1/n)
. - since we are resampling n times the probability of an observation not being chosen in the n trials is
(1–1/n)^n
.
From calculus, as n->∞
(n grows large), this probability can be determined using the concepts of limits. That is to say,

From here, we need to understand and apply L’Hôpital’s rule.

From Equation (4)
, we can prove that all the conditions we need to be met before applying L’Hôpital’s rule are actually met as follows:

We can then apply L’Hôpital’s rule to Equation 4
and proceed as shown below

Which ends our proof with confirmation that bootstrapping eliminates ≈36.22% of the data in the original sample as n
grows large.
That is the end of this article for today. If you liked it, please check out on the following article on Cross-Entropy Loss Function and read it as well. See you next time and nice reading 🙂
Join medium on https://medium.com/@kiprono_65591/membership to get full access to every story on Medium.
You can also get the articles into your email inbox whenever I post using this link: https://medium.com/subscribe/@kiprono_65591