The world’s leading publication for data science, AI, and ML professionals.

Tuning the Hyperparameters of your Machine Learning Model using GridSearchCV

Learn how to use the GridSearchCV function in sklearn to optimize your machine learning model

Photo by Roberta Sorge on Unsplash
Photo by Roberta Sorge on Unsplash

Two of the key challenges in machine learning are finding the right algorithm to use and optimizing your model. If you are familiar with machine learning, you may have worked with algorithms like Linear Regression, Logistic Regression, Decision Trees, Support Vector Machines, etc. Once you have decided on using a particular algorithm for your machine learning model, the next challenge is how to fine-tune the hyperparameters of your model so that your model works well with the dataset you have. In this article, I want to focus on the latter part – fine-tuning the hyperparameters of your model. As complex as the term may sound, fine-tuning your hyperparameters can actually be done quite easily using the Gridsearchcv function in the sklearn module.

Performing Classification using Logistic Regression

Before you learn how to fine-tune the Hyperparameters of your machine learning model, let’s try to build a model using the classic Breast Cancer dataset that ships with sklearn. Since this is a classification problem, we shall use the Logistic Regression as an example.

For classification problem, you can also use other algorithms like Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Naive Bayes, and more. But for this article, I will use Logistic Regression.

First, let’s load the dataset and load it into a Pandas DataFrame:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
df = pd.DataFrame(bc.data, columns = bc.feature_names)
df['diagnosis'] = bc.target
df

The first 30 columns are the various features and the last column is the diagnosis (0 for malignant and 1 for benign). For simplicity, I will use the 30 columns for training and the last column as the target .

Ideally, you should perform feature selection to filter out those columns that exhibit collinearity and as well as columns that do not have strong correlation with the target.

Let’s extract out the values for the features and label and save them as arrays:

dfX = df.iloc[:,:-1]   # Features - 30 columns
dfy = df['diagnosis']  # Label - last column
X = dfX.values
y = dfy.values
print(X.shape)   # (569, 30); 2D array
print(y.shape)   # (569,);    1D array

Split the dataset into a training set and testing set:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                       test_size=0.25, 
                                       random_state=2)

The following figure shows the use of the training and testing datasets:

Next, standardize the training and testing datasets:

from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

The StandardScaler class rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance). Source: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Finally, use the LogisticRegression class from sklearn to build a model using the training set and then use the testing set to obtain the predictions for all the items in the testing set:

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)

To see how well your model is performing, obtain its accuracy:

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
# OR
print("Accuracy:",logreg.score(X_test, y_test))

You can get the accuracy using either the score() function of the model that you have just built, or use the accuracy_score() function from the metrics module.

The accuracy for the above code snippet is:

Accuracy: 0.9790209790209791

Understanding Cross Validation

To understand how to optimize your model built using the previous section using the GridSearchCV, you need to understand what is cross validation. Remember in the previous section we divided the dataset into a training set and a testing set?

Image by author
Image by author

The testing set is used to evaluate the performance of the model that you have trained using the training set. While this is a good way to evaluate the model, it might not give you a true indication of the performance of the model. For all you know, the data in the testing set may be skewed, and using it to evaluate the model may give a very biased result. A much better way is to divide the entire data set into k-folds (or k-parts, i.e. k-fold means divide the dataset into 10 equal parts). Out of the k-folds, use 1 fold for testing and k-1 folds for training:

Image by author
Image by author

In each iteration, record the metrics (such as accuracy, precision, etc) and at the end of all the iterations, calculate the mean of these metrics. This gives your model a good mixture of your data for training and testing, and gives a better benchmark for the performance for your model. This process of splitting your data into k-folds and using 1 fold for testing and k-1 fold for testing is known as k-fold cross validation.

Using GridSearchCV for hyperparameters tuning

In our earlier example of the LogisticRegression class, we created an instance of the LogisticRegression class without passing it any initializers. Instead, we rely on the default values of the various parameters, such as:

  • penalty – Specify the norm of the penalty.
  • C – Inverse of regularization strength; smaller values specify stronger regularization.
  • solver – Algorithm to use in the optimization problem.
  • max_iter – Maximum number of iterations taken for the solvers to converge.

While it is alright in some cases to rely on the default values of these parameters (known as hyperparameters in Machine Learning), it is always good to be able to fine-tune their values so that the algorithm works best for the type of data you have. Unfortunately, it is not not a trivial task to find the perfect combination of hyperparameters that can fit your data perfectly. This is where GridSearchCV comes in.

GridSearchCV is a function that is in sklearn‘s model_selection package. It allows you to specify the different values for each hyperparameter and try out all the possible combinations when fitting your model. It does the training and testing using cross validation of your dataset – hence the acronym "CV" in GridSearchCV. The end result of GridSearchCV is a set of hyperparameters that best fit your data according to the scoring metric that you want your model to optimize on.

Let’s first create the parameter grid, which is a dictionary containing all the various hyperparameters that you want to try when fitting your model:

from Sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
# parameter grid
parameters = {
    'penalty' : ['l1','l2'], 
    'C'       : np.logspace(-3,3,7),
    'solver'  : ['newton-cg', 'lbfgs', 'liblinear'],
}

Note that I have turned off the warnings as the GridSearchCV() function tends to generate quite a bit of warnings.

You can then call the GridSearchCV() function using the algorithm that you are using, together with the various arguments as shown below:

logreg = LogisticRegression()
clf = GridSearchCV(logreg,                    # model
                   param_grid = parameters,   # hyperparameters
                   scoring='accuracy',        # metric for scoring
                   cv=10)                     # number of folds

The GridSearchCV() function returns a LogisticRegression instance (in this example, based on the algorithm that you are using), which you can then train using your training set:

clf.fit(X_train,y_train)

Once you are done with the training, you can now print out the tuned-hyperparameters as well as the training accuracy:

print("Tuned Hyperparameters :", clf.best_params_)
print("Accuracy :",clf.best_score_)

Here is the result that I have obtained from running the above code snippet:

Tuned Hyperparameters : {'C': 0.1, 
                         'penalty': 'l2', 
                         'solver': 'liblinear'}
Accuracy : 0.983499446290144

The accuracy of 0.9835 is now much better that the earlier accuracy of 0.9790.

With the values of the hyperparameters returned by the GridSearchCV() function, you can now use these values to build your model using the training dataset:

logreg = LogisticRegression(C = 0.1, 
                            penalty = 'l2', 
                            solver = 'liblinear')
logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)
print("Accuracy:",logreg.score(X_test, y_test))

The following figure summarizes what we have done:

Image by author
Image by author

Observe that:

  • GridSearchCV uses the Training set and the Validation set to perform cross validation.
  • Once the GridSearchCV found the values for the hyperparameters, we use the tuned parameters values to build a new model using the training set.
  • With the Testing set, we can now evaluate our new model.

The approach taken here allows us to have a metric in which we can measure the performance of our new model.

Another way of using GridSearchCV is to fit it using the entire dataset, like the following:

parameters = {
    'penalty' : ['l1','l2'], 
    'C'       : np.logspace(-3,3,7),
    'solver'  : ['newton-cg', 'lbfgs', 'liblinear'],
}
logreg = LogisticRegression()
clf = GridSearchCV(logreg, 
                   param_grid = parameters,
                   scoring = 'accuracy', 
                   cv = 10)
clf.fit(X,y)

The above code snippet returns the following result:

Tuned Hyperparameters : {'C': 1000.0, 
                         'penalty': 'l1', 
                         'solver': 'liblinear'}
Accuracy : 0.9701754385964911

The following figure summarizes what we have just done:

Image by author
Image by author

In this case, you let GridSearchCV use the entire dataset to derive the tuned parameters, and then use the newly acquired values to build a new model.

The approach taken here allows us to find out the best hyperparameters for our dataset, but does not allow you to accurately evaluate your model.

You can’t really evaluate this model using this approach because you do not want to pass any of the data that you have used for training for prediction. Why? Because the model has already seen the data during the training and hence would not give you an accurate measure of the performance of the model.

Join Medium with my referral link – Wei-Meng Lee

Summary

Using GridSearchCV can save you quite a bit of effort in optimizing your machine learning model. But do note that GridSearchCV will only evaluate your hyperparameters based on what you have supplied in the parameter grid. Of course, you may want to specify all the possible values that you can have for each hyperparameter, but doing so is going to be computationally expensive as all combinations will be evaluated. Finally, do not take GridSearchCV as a panacea for optimizing your model – do take the time to perform a proper feature selection prior to thinking about training your model.

Here are the links on my previous articles on feature selection:

Statistics in Python – Using ANOVA for Feature Selection

Statistics in Python – Using Chi-Square for Feature Selection

Statistics in Python – Collinearity and Multicollinearity

Statistics in Python – Understanding Variance, Covariance, and Correlation


Related Articles