The world’s leading publication for data science, AI, and ML professionals.

Increasing Model Reliability: Model Selection – Cross-Validation –

Model selection/types to increase result reliability with python implementation in one view.

It is essential that the model prepared in Machine Learning gives reliable results for the external datasets, that is, generalization. After a part of the dataset is reserved as a test and the model is trained, the accuracy obtained from the test data may be high in the test data while it is very low for external data. For example, if only the data with x label is selected as the randomly selected test dataset in the dataset with x, y, z labels, this actually gives the accuracy of only the x label, not the model, but this may not be noticed by the developer. This, the model is not generalized, is definitely an undesirable case. This article contains different configurations from which training data and test data can be selected to increase the result reliability of the model. These methods are essential for the model to respond correctly to open-world projects.

Table of Contents 
1. Train Test Split
2. Cross Validation
2.1. KFold Cross Validation
2.2. Stratified KFold Cross Validation
2.3. LeaveOneOut Cross Validation
2.4. Repeated KFold Cross Validation
2.5. ShuffleSplit Cross Validation
2.6. Group KFold Cross Validation
Photo by Siami Tan on Unsplash
Photo by Siami Tan on Unsplash

1. Train Test Split

In Machine Learning applications, the model is trained with the dataset. With this training, the model learns the relationship between features and output(target). Then, the accuracy of the model is evaluated with another dataset in the same format. Train test split separates the data into train data and test data at the rate selected by the user.

IN[1]
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
x_train,x_test,y_train,y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=2021)
print("shape of x_train",x_train.shape)
print("shape of x_test",x_test.shape)
OUT[1]
shape of x_train (120, 4)
shape of x_test (30, 4)

As seen in OUT[1], The dataset is separated into 20% test data and 80% train data.

2. Cross-Validation

Although it seems very useful to separate the model with _train_testsplit, the accuracy rate obtained from the test dataset may not reflect the truth. For example, when we randomly separate the dataset containing A, B, and C labels with _train_testsplit, the data with A and B labels can be separated as train and all C labels as a test. In this context, there is a difference between the trained and tested data and it becomes misleading about the success of the model. Therefore, different methods are used when separating the dataset into train data and test data. Now let’s examine the types of cross-validation based on statistics and easily implemented with the scikit learn library.

2.1. KFold Cross-Validation

The dataset is divided into the number(k) selected by the user. The model is split as many as the number of parts, each part is called fold, and a different fold is used as a test dataset in each split. For example, if a dataset with 150 data is set to k=3, the model gives us 3 accuracy values, and a different 1/3 piece (0–50, 50–100, 100–150) is used as a test for each accuracy. The remaining 2/3 parts are used for the train in each iteration.

IN[2]
heart_dataset=pd.read_csv('heart.csv')
heart_data   =heart_dataset.drop('output',axis=1)
heart_target =heart_dataset['output']
rb=RobustScaler()
heart_robust=rb.fit_transform(heart_data)
IN[3]
kf = KFold(n_splits=5)
i=1
for train_data,test_data in kf.split(X=heart_data, y=heart_target):
    print('iteration',i)
    print(train_data[:10],"length:", len(train_data))
    print(test_data[:10],"length:", len(test_data))
    print("**********************************")
    i +=1
OUT[3]
iteration 1
[61 62 63 64 65 66 67 68 69 70] train_length: 242
[0 1 2 3 4 5 6 7 8 9] test_length: 61
**********************************
iteration 2
[0 1 2 3 4 5 6 7 8 9] train_length: 242
[61 62 63 64 65 66 67 68 69 70] test_length: 61
**********************************
iteration 3
[0 1 2 3 4 5 6 7 8 9] train_length: 242
[122 123 124 125 126 127 128 129 130 131] test_length: 61
**********************************
iteration 4
[0 1 2 3 4 5 6 7 8 9] train_length: 243
[183 184 185 186 187 188 189 190 191 192] test_length: 60
**********************************
iteration 5
[0 1 2 3 4 5 6 7 8 9] train_length: 243
[243 244 245 246 247 248 249 250 251 252] test_length: 60
**********************************

The heart disease dataset consists of 14 columns and 303 rows. All feature values are numeric so that RobustScaler is applied. The output of the dataset consists of 0 and 1. OUT[3] shows the first 10 data of test and train data in each split. As can be seen, the test data from the first split starts from the first data of the dataset; test data from the second split starts from 61. data; test data from the third split starts from 122. data; test data from data the fourth split starts from data 183.data and the final split starts from data 243. data

IN[4]
scores=cross_val_score(LogisticRegression(),heart_robust,heart_target,cv=5)
print(scores)
print("mean accuracy:",scores.mean())
OUT[4]
[0.83606557 0.86885246 0.83606557 0.86666667 0.76666667]
mean accuracy: 0.8348633879781422

The heart disease dataset of 303 numeric data has been split 5 times with logistic regression with the value of k=5. Logistic Regression accuracy for each split is [0.83606557 0.86885246 0.83606557 0.86666667 0.76666667], respectively.

KFold Cross-Validation with Shuffle

In the k-fold cross-validation, the dataset was divided into k values in order. When the shuffle and the _randomstate value inside the KFold option are set, the data is randomly selected:

IN[5]
kfs = KFold(n_splits=5, shuffle=True, random_state=2021)
scores_shuffle=cross_val_score(LogisticRegression(),heart_robust,heart_target,cv=kfs)
print(scores_shuffle)
print("mean accuracy:",scores_shuffle.mean())
OUT[5]
[0.83606557 0.78688525 0.78688525 0.85       0.83333333]
mean accuracy: 0.8186338797814209

2.2. Stratified KFold Cross-Validation

The dataset is divided into user-selected number(k) parts. Unlike KFold, each target is also split and combined by k. For example, if we consider the iris dataset (first 50 data iris setosa; 50–100 Iris Versicolor, 100–150 Iris Virginica) and split by selecting the k value of 5:

IN[6]
iris_dataset=pd.read_csv('iris.csv')
iris_data   =iris_dataset.drop('Species',axis=1)
iris_data   =iris_data.drop(['Id'],axis=1)
iris_target =iris_dataset['Species']
IN[7]
skf = StratifiedKFold(n_splits=5)
i=1
for train_data,test_data in skf.split(X=iris_data, y=iris_target):
    print('iteration',i)
    print(test_data,"length", len(test_data))
    print("**********************************")
    i +=1
OUT[7]
iteration 1
[  0   1   2   3   4   5   6   7   8   9  50  51  52  53  54  55  56  57 58  59 100 101 102 103 104 105 106 107 108 109] length 30
**********************************
iteration 2
[ 10  11  12  13  14  15  16  17  18  19  60  61  62  63  64  65  66  67 68  69 110 111 112 113 114 115 116 117 118 119] length 30
**********************************
iteration 3
[ 20  21  22  23  24  25  26  27  28  29  70  71  72  73  74  75  76  77 78  79 120 121 122 123 124 125 126 127 128 129] length 30
**********************************
iteration 4
[ 30  31  32  33  34  35  36  37  38  39  80  81  82  83  84  85  86  87 88  89 130 131 132 133 134 135 136 137 138 139] length 30
**********************************
iteration 5
[ 40  41  42  43  44  45  46  47  48  49  90  91  92  93  94  95  96  97 98  99 140 141 142 143 144 145 146 147 148 149] length 30
**********************************

When the test dataset is analyzed, the first 10 data of each target (0–10; 50–60; 100–110) in the 1st iteration, the second 1/5 slice of each target in the 2nd iteration (20–30; 60- 70; 110–120) and so on. The remaining data in each iteration was used for the training, and linear regression is used for the training data in each iteration and naturally, 5 different accuracies are obtained.

IN[8]
lr=LogisticRegression()
le=LabelEncoder()
iris_labels=le.fit_transform(iris_target)
rb=RobustScaler()
iris_robust=rb.fit_transform(iris_data)
iris_robust=pd.DataFrame(iris_robust)
IN[9]
scores_skf = []
i = 1
for train_set, test_set in skf.split(X=iris_robust, y=iris_labels):
    lr.fit(iris_robust.loc[train_set], iris_labels[train_set])
    sco = lr.score(iris_robust.loc[test_set], iris_labels[test_set])
    scores_skf.append(sco)
    i += 1
print(scores_skf)
print("mean accuracy:",sum(scores_skf) / len(scores_skf))
OUT[9]
[0.9, 0.9666666666666667, 0.9333333333333333, 0.9333333333333333, 0.9666666666666667]
mean accuracy: 0.9400000000000001

Due to the "’numpy.ndarray’ object has no attribute ‘loc’", the NumPy array of _irisrobust is converted to pandas DataFrame.

Stratified KFold Cross-Validation can be easily implemented as seen with _cross_valscore in Scikit learn.

IN[10]
score_skf=cross_val_score(lr,iris_robust,iris_labels,cv=skf)
print(score_skf)
print("mean accuracy:",score_skf.mean())
OUT[10]
[0.9        0.96666667 0.93333333 0.93333333 0.96666667]
mean accuracy: 0.9400000000000001

Both results are the same.

Since the target values in the iris dataset were equal (50–50–50), it is placed equally from each. However, if the labels in the dataset were in different proportions, each fold would contain data at this rate. For example, if there were 100 data for label-x and 900 data for label-y, each fold would contain 90% label-y data and 10% label-x data.

Looking at the split method in the website of scikit learn, it is seen that the y value should be (n_samples,) so it is designed to run with a Label Encoder instead of OneHotEncoder.

2.3. LeaveOneOut Cross-Validation

Each data is considered as a fold, so the value of k is equal to the number of data. Each data is separated one by one and the model is trained with the remaining data. The separated data is tested with the trained model. If we consider the iris dataset:

IN[11]
loo = cross_val_score(estimator=LogisticRegression(), X=iris_robust, y=iris_labels,
                               scoring='accuracy', cv=LeaveOneOut())
print(loo,"len of loo=",len(loo))
print("mean accuracy:",loo.mean())
OUT[11]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] len of loo= 150
mean accuracy: 0.9466666666666667

As can be seen, the iris dataset consisting of 150 datasets is modeled with LeaveOneOut and logistic regression is applied. Each data is separated and the model was trained with the remaining data. In total, the model has trained 150 times and 150 predictions (for each data) are made. All of them are averaged.

Photo by Brian Suman on Unsplash
Photo by Brian Suman on Unsplash

2.4. Repeated KFold Cross-Validation

K fold Cross-validation is repeated as many times as the value selected by the user. For the iris dataset:

IN[12]
rkf = cross_val_score(estimator=LogisticRegression(), X=iris_robust, y=iris_labels, scoring='accuracy', cv=RepeatedKFold(n_splits=5, n_repeats=5))
print("accuracy:", rkf)
print("mean accuracy",rkf.mean())
OUT[12]
accuracy: 
[0.96666667 0.9       0.93333333 0.93333333 0.93333333 0.9
 0.93333333 1.         0.96666667 0.96666667 0.86666667 0.96666667
 0.96666667 0.96666667 1.         0.96666667 0.9        1.
 0.93333333 0.93333333 0.86666667 0.96666667 1.         0.9
 0.96666667]
mean accuracy 0.9453333333333331

The dataset was divided into 5 parts and the algorithm was fit 5 times. As a result, 25 accuracy values were obtained.

The same can be done for Stratified KFold:

IN[13]
rskf = cross_val_score(estimator=LogisticRegression(), X=iris_robust, y=iris_labels, scoring='accuracy', cv=RepeatedStratifiedKFold(n_splits=5, n_repeats=5))
print("accuracy", rskf)
print("mean accuracy",rskf.mean())
OUT[13]
accuracy
[0.96666667 0.9        0.96666667 0.96666667 0.96666667 0.9
 0.96666667 0.96666667 0.9        0.96666667 0.96666667 0.96666667
 0.9        0.96666667 0.96666667 0.9        0.93333333 1.
 0.96666667 0.96666667 0.96666667 0.93333333 1.         0.9
 0.93333333]
mean accuracy 0.9493333333333333

2.5. Shuffle Split Cross-Validation

The number of iterations of the dataset is set with _nsplits, and the train and test data are randomly selected for each split at user-specified ratios:

IN[14]
shuffle_split = ShuffleSplit(test_size=.4, train_size=.5, n_splits=10)
scores_ss = cross_val_score(LogisticRegression(), iris_robust, iris_labels, cv=shuffle_split)
print("Accuracy",scores_ss)
print("mean accuracy:",scores_ss.mean())
OUT[14]
Accuracy
[0.9        0.93333333 0.88333333 0.9        0.95       0.95
 0.93333333 0.91666667 0.95       0.95      ]
mean accuracy: 0.9266666666666665

The same procedure can be done for Stratified Shuffle Split:

IN[15]
shuffle_sfs=StratifiedShuffleSplit(test_size=.4, train_size=.5, n_splits=10)
scores_sfs = cross_val_score(LogisticRegression(), iris_robust, iris_labels, cv=shuffle_sfs)
print("Accuracy",scores_sfs)
print("mean accuracy:",scores_sfs.mean())
OUT[15]
Accuracy
[0.88333333 0.93333333 0.93333333 0.93333333 0.91666667 0.96666667
 0.96666667 0.96666667 0.88333333 0.9       ]
mean accuracy: 0.9283333333333333

2.6. Group KFold Cross-Validation

It is used when more than one data is received from the same object. For example, in medical data, it is better to have more than one image from the same patient in the training dataset for the generalization of the model. To achieve this, GroupKFold can be used, which takes as an argument an array of groups that we can use to indicate which person is in the image. The array of groups here specifies the groups in the data that should not be split when creating training and test sets and should not be confused with the class label. With GroupKFold, the group is either on the training set or on the test set.

Back to the guideline click here.

Machine Learning Guideline


Related Articles