
Finding the optimal tuning parameters for a machine learning problem can often be very difficult. We may encounter overfitting, which means our machine learning model trains too specifically on our training dataset and causes higher levels of error when applied to our test/holdout datasets. Or, we may run into underfitting, which means our model doesn’t train specifically enough to our training dataset. This also leads to higher levels of error when applied to test/holdout datasets.
When conducting a normal train/validation/test split for model training and testing, the model trains on a specific randomly selected portion of the data, validates on a separate set of data, then finally tests on a holdout dataset. In practice this could lead to some issues, especially when the size of the dataset is relatively small, because you could be eliminating a portion of observations that would be key to training an optimal model. Keeping a percentage of data out of the training phase, even if its 15–25% still holds plenty of information that would otherwise help our model train more effectively.
In comes a solution to our problem – Cross Validation. Cross validation works by splitting our dataset into random groups, holding one group out as the test, and training the model on the remaining groups. This process is repeated for each group being held as the test group, then the average of the models is used for the resulting model.
One of the most common types of cross validation is k-fold cross validation, where ‘k’ is the number of folds within the dataset. Using k =5 is a common first step and easy for demonstrations of this principle below:

Here we see five iterations of the model, each of which treats a different fold as the test set and trains on the other four folds. Once all five iterations are complete the resulting iterations are averaged together creating the final cross validation model.
While cross validation can greatly benefit model development, there is also an important drawback that should be considered when conducting cross validation. Because each iteration of the model, up to k times, requires you to run the full model, it can get computationally expensive as your dataset gets larger and as the value of ‘k’ increases. For example, running a cross validation model of k = 10 on a dataset with 1 million observations requires you to run 10 separate models, each of which uses all 1 million observations. This won’t really be an issue with small datasets as the compute time would be in the scale of minute but when working with larger datasets with sizes in scales of many Gb or Tb, the time required will significantly increase.
For the remainder of this article we will look to implement cross validation on the random forest model created in my prior article linked here. Additionally, we will implement what is known as grid search, which allows us to run the model over a grid of hyperparameters in order to identify the optimal result.
Data: For this article, I will continue to use the Titanic survivor data posted to Kaggle by Syed Hamza Ali located here, this data is licensed CC0 – Public Domain. This dataset provides information on passengers such as age, ticket class, sex, and a binary variable for whether the passenger survived. This data could also be used to compete in the Kaggle Titanic ML competition, so in the spirit of keeping this competition fair, I won’t show all the steps I took to conduct EDA & data wrangling, or directly post the code. I will be building off my prior model developed in the article mentioned above.
As a reminder, the base random forest training model used looked like the following:
# Train/Test split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 18)
# Model training
clf = RandomForestClassifier(n_estimators = 500, max_depth = 4, max_features = 3, bootstrap = True, random_state = 18).fit(x_train, y_train)
And the results we achieved were:

For this article, we will keep this train/test split portion to keep the holdout test data consistent between models, but we will use cross validation and grid search for parameter tuning on the training data to see how our resulting outputs differs from the output found using the base model above.
GridSearchCV: The module we will be utilizing in this article is sklearn’s GridSearchCV, which will allow us to pass our specific estimator, our grid of parameters, and our chosen number of cross validation folds. The documentation for this method can be found here. Some of the main parameters are highlighted below:
- estimator – this parameter allows you to select the specific model you’re choosing to run, in our case Random Forest Classification.
- param_grid – this parameter allows you to pass the grid of parameters you are searching. This grid must be formatted as a dictionary with the key corresponding to the specific estimator’s parameter names, and the values corresponding to a list of values to pass for the specific parameters.
- cv – this parameter allows you to change the number of folds for the cross validation.
Model Training:We will first create a grid of parameter values for the random forest classification model. The first parameter in our grid is n_estimators, which selects the number of trees used in our random forest model, here we select values of 200, 300, 400, or 500. Next, we chose the values of the max_feature parameter, which limits the number of features considered per tree. We set this parameter as ‘sqrt’ or ‘log2’, which will take the form of the squared root, or log base 2, of the number of estimators in the dataset. The third parameter is max_depth, which sets the maximum tree depth per tree in the random forest model to 4, 5, 6, 7, or 8. Lastly, the criterion parameter will search through ‘gini’ or ‘entropy’ to find the ideal criterion. This grid can be seen below:
grid = {
'n_estimators': [200,300,400,500],
'max_features': ['sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy'],
'random_state' : [18]
}
After creating our grid we can run our GridSearchCV model passing RandomForestClassifier() to our estimator parameter, our grid to the param_grid parameter, and a cross validation fold value of 5.
rf_cv = GridSearchCV(estimator=RandomForestClassifier(), param_grid=grid, cv= 5)
rf_cv.fit(x_train, y_train)
We can now use the ".bestparams" method of the model to output the best parameters for our model.
rf_cv.best_params_

Now that we have our optimal list of parameters, we can run the basic RandomForestClassifier model using these parameters and test our results compared to the results obtained using our original train/test split without grid search.
rf2 = RandomForestClassifier(n_estimators = 200, max_depth = 7, max_features = 'sqrt',random_state = 18, criterion = 'gini').fit(x_train, y_train)

The results of our more optimal model outperform our initial model with an accuracy score of 0.883 compared to 0.861 prior, and an F1 score of 0.835 compared to 0.803.
The one drawback experienced while incorporating GridSearchCV was the runtime. As mentioned earlier, cross validation & grid tuning lead to longer training times given the repeated number of iterations a model must train through. The overall GridSearchCV model took about four minutes to run, which may not seem like much, but take into consideration that we only had around 1k observations in this dataset. How long do you think it would’ve taken with 100k observations, or millions of observations?
Conclusion:By using cross validation and grid search we were able to have a more meaningful result when compared to our original train/test split with minimal tuning. Cross validation is a very important method used to create better fitting models by training and testing on all parts of the training dataset.
Thank you for taking the time to read this article! I hope you enjoyed reading and have learned more about how to apply cross validation & grid search to your Machine Learning models. If you enjoyed what you read, please follow my profile to be among the first to see future articles!