The world’s leading publication for data science, AI, and ML professionals.

How to train_test_split : KFold vs StratifiedKFold

Explained with examples

Photo by elizabeth lies on Unsplash
Photo by elizabeth lies on Unsplash

The data used in supervised learning tasks contains features and a label for a set of observations. The algorithms try to model the relationship between features (independent variables) and label (dependent variable). We first train the model by providing both features and label for some observations. Then test the model by only providing features and expecting it to predict the labels. Thus, we need to split the data into training and test subsets. We let the model to learn on training set and then measure its performance on test set.

Scikit-learn library provides many tools to split data into training and test sets. The most basic one is train_test_split which just divides the data into two parts according to the specified partitioning ratio. For instance, train_test_split(test_size=0.2) will set aside 20% of the data for testing and 80% for training. Let’s see how it is done on an example. We will create a sample dataframe with one feature and a label:

import pandas as pd
import numpy as np
target = np.ones(25)
target[-5:] = 0
df = pd.DataFrame({'col_a':np.random.random(25),
                  'target':target})
df

Then we apply train_test_split function:

from sklearn.model_selection import train_test_split
X = df.col_a
y = df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
print("TRAIN:", X_train.index, "TEST:", X_test.index)

The first 80% is training and the last 20% is test set. If we set the shuffle parameter to True, the data will be randomly split:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
print("TRAIN:", X_train.index, "TEST:", X_test.index)

The default value of shuffle is True so data will be randomly splitted if we do not specify shuffle parameter. If we want the splits to be reproducible, we also need to pass in an integer to random_state parameter. Otherwise, each time we run train_test_split, different indices will be splitted into training and test set. Please note that the numbers seen in the outputs are indices of data points, not the actual values.

Data is a valuable asset and we want to make use of every bit of it. If we split data using train_test_split, we can only train a model with the portion set aside for training. The models get better as the amount of training data increases. One solution to overcome this issue is cross validation. With cross validation, dataset is divided into n splits. N-1 split is used for training and the remaining split is used for testing. The model runs through the entire dataset n times and at each time, a different split is used for testing. Thus, we use all of data points for both training and testing. Cross validation is also useful to measure the performance of a model more accurately, especially on new, previously unseen data points.

There are different methods to split data in cross validation. KFold and StratifiedKFold are commonly used.

KFold

As the name suggests, KFold divides the dataset into k folds. If shuffle is set to False, consecutive folds will be the shifted version of previous fold:

X = df.col_a
y = df.target
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

At first iteration, test set is the first four indices. Then KFold keeps shifting the test set k times. If shuffle is set to True, then the splitting will be random.

kf = KFold(n_splits=4, shuffle=True, random_state=1)
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

StratifiedKFold

StratifiedKFold takes the cross validation one step further. The class distribution in the dataset is preserved in the training and test splits. Let’s take a look at our sample dataframe:

There are 16 data points. 12 of them belong to class 1 and remaining 4 belong to class 0 so this is an imbalanced class distribution. KFold does not take this into consideration. Therefore, in classifications tasks with imbalanced class distributions, we should prefer StratifiedKFold over KFold.

The ratio of class 0 to class 1 is 1/3. If we set k=4, then the test sets include three data points from class 1 and one data point from class 0. Thus, training sets include three data points from class 0 and nine data points from class 1.

skf = StratifiedKFold(n_splits=4)
for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)

The indices of class 0 are 12, 13, 14, and 15. As we can see, the class distribution of the dataset is preserved in the splits. We can also use shuffling with StratifiedKFold:

skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=1)
for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)

Another method used for splitting is called "leave one out" which only use one data point for testing and remaining data points for training. Scikit learn has LeaveOneOut class to perform this type of partitioning.


Finally, I would like to mention about another important tool provided by scikit-learn which is cross_val_score.

Cross_val_score takes the dataset and applies cross validation to split the data. Then, train a model using the specified estimator (e.g. logistic regression, decision tree, …) and measure the performance of the model (scoring parameter).


Thank you for reading. Please let me know if you have any feedback.


Related Articles