Available hyperparameter-optimization techniques

We apply three different techniques of hyperparameter optimization on a classification model set to compare their accuracy.

Gabriel Naya
Towards Data Science

--

Photo by Vania Shows on Unsplash

Background

I have often received articles referring to automation searching for hyperparameter optimization. Once I have understood the intent and scope of these techniques, I have wondered how much a model can be improved through this optimization.
The aim of this article is then to investigate the different available optimization techniques, and test them on a simple example, compare them and see an overview of the improvements obtained.

Hyperparameters are model adjustable parameters that must be tuned to obtain a model with optimal performance. Then, optimizing the hyperparameters of a model is a crucial task to increase the performance of the selected algorithm.

We need to know, to some extent, the implication that each hyperparameter has in each algorithm and its possible values. Understanding in-depth the meaning of each of these hyperparameters in the different algorithms is something necessary and a huge task that sometimes implies knowing the way the algorithm works internally and the mathematics behind it. The content of this article does not reach that depth, although we will use different algorithms to analyze by selecting some hyperparameters in each of them.

Anyone who has used any algorithm has probably already made some manual optimization attempts on the default set of values. This manual adjustment usually takes a long time, is not always done rigorously and makes it difficult to systematize the results

Secondly, we can apply available automated and quite simple techniques such as Grid Search and Random Search, which usually give better results, but with a high cost of time and machine computation. We will apply both techniques to compare their results

Finally, we will apply Bayesian optimization, which is a method to find the minimum of a function, using the hyperopt library of Python on the best of the tested algorithms. The implementation of this technique may not be so easy, but it can give us better results in performance or time than the previous ones.

The dataset

The dataset we will use for the exercise is Kaggle’s Titanic passenger dataset, which is a binary classification exercise to predict which passengers have survived and which have not.

import os
import numpy as np # linear algebra
import pandas as pd #
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import lightgbm as lgb
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
%matplotlib inlinenp.random.seed(7)
train = pd.read_csv('./data/train.csv', index_col=0)
y = train.Survived #.reset_index(drop=True)
features = train.drop(['Survived'], axis=1)
features.head()

Metric selected is accuracy¹: the percentage of passengers correctly predicted.
Accuracy = (TP+TN) / (Total)

Feature engineering and results data frame

We are going to apply fundamental techniques of data engineering that will allow us to use the algorithms without errors (remember that it is not the objective of this article to obtain a good result, only to compare the performance when applying optimization techniques).

We select eight algorithms to perform the work; this decision is based on reusing part of the work of Jason Brownlee² and Will Koehrsen³ that have developed much of the code used here.

Each of these models will be a different column in our results table:

  1. Stochastic Gradient Boosting

2. Ridge Classifier

3. K-Nearest Neighbors (KNN)

4. Support Vector Machine (SVM)

5. Bagged Decision Trees

6. Random Forest

7. Logistic Regression

8. LGBM

For each of these columns, we will try to apply the following optimization techniques:

  • Default hyperparameters
  • Sklearn GridSearchCV
  • Sklearn RandomizedSearchCV
  • Hyperopt for Python
features = features.drop(['Cabin'], axis=1)
features = features.drop(['Name'], axis=1)
objects = [col for col in features.columns if features[col].dtype == "object"]
features.update(features[objects].fillna('None'))
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerics = [col for col in features.columns if features[col].dtype in numeric_dtypes]
features.update(features[numerics].fillna(0))
features.info()
# dummies and split
X = pd.get_dummies(features)
X_train, X_valid, y_train, y_valid = train_test_split(X,y,train_size=0.70,test_size=0.30,random_state=0)
#Results dataframe
cols = ['Case','SGD','Ridge','KNN','SVM','Bagging','RndForest','LogReg','LGB']
resul = pd.DataFrame(columns=cols)
resul.set_index("Case",inplace=True)
resul.loc['Standard'] = [0,0,0,0,0,0,0,0]
resul.loc['GridSearch'] = [0,0,0,0,0,0,0,0]
resul.loc['RandomSearch'] = [0,0,0,0,0,0,0,0]
resul.loc['Hyperopt'] = [0,0,0,0,0,0,0,0]
resul.head()
Empty dataframe for results

Row 1 — Default hyperparameters of each algorithm

As mentioned above, the first row of our results table is the starting point of the analysis, taking the default values for the hyperparameters of each of the algorithms:

#Models creation
sgd = SGDClassifier()
ridge = RidgeClassifier()
knn = KNeighborsClassifier()
svc = SVC(gamma='auto')
bag = BaggingClassifier()
rf = RandomForestClassifier(n_estimators=10)
lr = LogisticRegression(solver='liblinear')
lgg = lgb.LGBMClassifier()
models = [sgd,ridge,knn,svc,bag,rf,lr,lgg]col = 0
for model in models:
model.fit(X_train,y_train.values.ravel())
resul.iloc[0,col] = model.score(X_valid,y_valid)
col += 1
resul.head()

Row 2—Applying GridSearchCV

“Grid search works by trying every possible combination of parameters you want to try in your model. Those parameters are each tried in a series of cross-validation passes. This technique has been in vogue for the past several years as a way to tune your models” ⁴.

Now we must define a dictionary with the different parameters and their values for each algorithm

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
#SGD
loss = ['hinge', 'modified_huber', 'log']
penalty = ['l1','l2']
alpha= [0.0001,0.001,0.01,0.1]
l1_ratio= [0.15,0.05,.025]
max_iter = [1,5,10,100,1000,10000]
sgd_grid = dict(loss=loss,penalty=penalty,max_iter=max_iter,alpha=alpha,l1_ratio=l1_ratio)
#Ridge
alpha = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
ridge_grid = dict(alpha=alpha)
#K-Nearest - Neighborg
n_neighbors = range(1, 21, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
knn_grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)
#Support Vector Classifier
kernel = ['poly', 'rbf', 'sigmoid']
C = [50, 10, 1.0, 0.1, 0.01]
gamma = ['scale']
svc_grid = dict(kernel=kernel,C=C,gamma=gamma)
#Bagging Classifier
n_estimators = [10, 100, 1000]
bag_grid = dict(n_estimators=n_estimators)
#Random Forest
n_estimators = [10, 100, 1000,10000]
max_features = ['sqrt', 'log2']
rf_grid = dict(n_estimators=n_estimators,max_features=max_features)
#Logistic Regrresion
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
lr_grid = dict(solver=solvers,penalty=penalty,C=c_values)
#LGB
class_weight = [None,'balanced']
boosting_type = ['gbdt', 'goss', 'dart']
num_leaves = [30,50,100,150] #list(range(30, 150)),
learning_rate = list(np.logspace(np.log(0.005), np.log(0.2), base = np.exp(1), num = 10)) #1000
lgg_grid = dict(class_weight=class_weight, boosting_type=boosting_type, num_leaves=num_leaves, learning_rate =learning_rate)

Then, apply GridSearchCV for each one:

models = [sgd,ridge,knn,svc,bag,rf,lr,lgg]
grids = [sgd_grid,ridge_grid,knn_grid,svc_grid,bag_grid,rf_grid,lr_grid,lgg_grid]
col = 0
for ind in range(0,len(models)):
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3,
random_state=1)
grid_search = GridSearchCV(estimator=models[col],
param_grid=grids[col], n_jobs=-1, cv=cv,
scoring='accuracy',error_score=0)
grid_clf_acc = grid_search.fit(X_train, y_train)
resul.iloc[1,col] = grid_clf_acc.score(X_valid,y_valid)
col += 1
resul.head()

The different algorithms have many more parameters that have not been deliberately including here, and you could force searches to many more values in each hyperparameter.
This reduction has been developing according to the time and computing capacity that we are willing to invest.

Row 3— Applying RandomSearchCV

“Enter randomized search. Consider trying every possible combination takes a lot of brute force computation. Data Scientists are an impatient bunch, so they adopted a faster technique: randomly sample from a range of parameters. The idea is that you will cover on the near-optimal set of parameters faster than grid search. This technique, however, is naive. It doesn’t know or remember anything from its previous runs.” ⁴

For practical purposes, the code is identical to GridSearchCV by modifying:

random_search = RandomizedSearchCV(models[col], 
param_distributions=grids[col],n_iter=n_iter_search,
cv=cv)

instead of

grid_search = GridSearchCV(estimator=lr, param_grid=lr_grid, 
n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)

Here the complete code:

from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
col = 0
for ind in range(0,len(models)):
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3,
random_state=1)
n_iter_search = 3
random_search = RandomizedSearchCV(models[col],
param_distributions=grids[col],n_iter=n_iter_search, cv=cv)
random_search.fit(X_train,y_train)
resul.iloc[2,col] = random_search.score(X_valid,y_valid)
col += 1
resul.head()

Analyzing the results

So far, if we check the results grid, we find that the application of automated search techniques has given good results.
In some cases, like SGD or SVM algorithm has been much improved, from a 67–69% floor to 78–76%.
The general trend has been to improve 1, 2, or 3 percentage points, obtaining better results with GridSearchCV than with RandomSearchCV, being the random application times better than those of the grid.

Let’s analyze the winning algorithm (Light Gradient Boost) in his version of GridSearchCV:

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3,  
random_state=1)
n_iter_search = 3
grid_search = GridSearchCV(estimator=lgg, param_grid=lgg_grid,
n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_win = grid_search.fit(X_train, y_train)
#Predict values based on new parameters
yv_pred = grid_win.predict(X_valid)
print(confusion_matrix(y_valid, yv_pred))
skplt.metrics.plot_confusion_matrix(y_valid, yv_pred,figsize=(8,8))

To know the hyperparameters found in this version of the algorithm we use:

print("Best: %f using %s" % (grid_win.best_score_, grid_win.best_params_))

Row 4 — Applying bayesian hyperoptimization with hyperopt library

I must confess that before I understood how to apply the Bayesian optimization technique, I felt quite lost and confused, to the point of being close to giving up.
The documentation and examples both from the library and from other articles that I could take as references are quite vague, or too dull, or very outdated.
All this until I found this excellent tutorial⁵ and its code⁶ by Will Koehrsen, which I advise you to review and try step by step since it is clear and exhaustive. Following him, the first thing we are going to do is to define our objective function that must return a dictionary at least with the tags ‘loss’ and ‘status’.

import csv
from hyperopt import STATUS_OK
from timeit import default_timer as timer
MAX_EVALS = 500
N_FOLDS = 10
def objective(params, n_folds = N_FOLDS): """Objective function for Gradient Boosting Machine Hyperparameter Optimization""" # Keep track of evals
global ITERATION
ITERATION += 1
# Retrieve the subsample if present otherwise set to 1.0
subsample = params['boosting_type'].get('subsample', 1.0)
# Extract the boosting type
params['boosting_type'] = params['boosting_type']
['boosting_type']
params['subsample'] = subsample
# Make sure parameters that need to be integers are integers
for parameter_name in ['num_leaves', 'subsample_for_bin',
'min_child_samples']:
params[parameter_name] = int(params[parameter_name])
start = timer()
# Perform n_folds cross validation
cv_results = lgb.cv(params, train_set, num_boost_round = 10000,
nfold = n_folds, early_stopping_rounds = 100,
metrics = 'auc', seed = 50)
run_time = timer() - start
# Extract the best score
best_score = np.max(cv_results['auc-mean'])
# Loss must be minimized
loss = 1 - best_score
# Boosting rounds that returned the highest cv score
n_estimators = int(np.argmax(cv_results['auc-mean']) + 1)
# Write to the csv file ('a' means append)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([loss, params, ITERATION, n_estimators,
run_time])
# Dictionary with information for evaluation
return {'loss': loss, 'params': params, 'iteration': ITERATION,
'estimators': n_estimators, 'train_time': run_time,
'status': STATUS_OK}

Domain Space: The domain space represents the range of values we want to evaluate for each hyperparameter. Each iteration of the search, the Bayesian optimization algorithm will choose one value for each hyperparameter from the domain space. When we do random or grid search, the domain space is a grid. In Bayesian optimization the idea is the same except this space has probability distributions for each hyperparameter rather than discrete values.

space = {
'class_weight': hp.choice('class_weight', [None, 'balanced']),
'boosting_type': hp.choice('boosting_type', [
{'boosting_type': 'gbdt', 'subsample': hp.uniform('gdbt_subsample', 0.5, 1)},
{'boosting_type': 'dart', 'subsample': hp.uniform('dart_subsample', 0.5, 1)},
{'boosting_type': 'goss', 'subsample': 1.0}]),
'num_leaves': hp.quniform('num_leaves', 30, 150, 1),
'learning_rate': hp.loguniform('learning_rate', np.log(0.01),
np.log(0.2)),
'subsample_for_bin': hp.quniform('subsample_for_bin', 20000,
300000),
'min_child_samples': hp.quniform('min_child_samples', 20, 500, 5),
'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)}

Here use different domain distribution types (look for complete distributions list in the hyperopt documentation):

  • choice : categorical variables
  • quniform : discrete uniform (integers spaced evenly)
  • uniform: continuous uniform (floats spaced evenly)
  • loguniform: continuous log uniform (floats spaced evenly on a log scale)
from hyperopt import tpe
from hyperopt import Trials
# optimization algorithm
tpe_algorithm = tpe.suggest
# Keep track of results
bayes_trials = Trials()
# File to save first results
out_file = './gbm_trials.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
# Write the headers to the file
writer.writerow(['loss', 'params', 'iteration', 'estimators', 'train_time'])
of_connection.close()

Finally, with all the code ready, we look for the best combination of parameters through the fmin function:

from hyperopt import fmin# Global variable
global ITERATION
ITERATION = 0
MAX_EVALS = 100
# Create a lgb dataset
train_set = lgb.Dataset(X_train, label = y_train)
# Run optimization
best = fmin(fn = objective, space = space, algo = tpe.suggest,
max_evals = MAX_EVALS, trials = bayes_trials,
rstate =np.random.RandomState(50))

It triggers the process of finding the best combination. In the original example, the variable MAX_EVALS was set to 500; due to performance issues for this exercise, it was reduced to 100, which may have affected the final result.

Once the process is finishing, we can take the Trials object (bayes_trials in our case) and analyze its results:

# Sort the trials with lowest loss (highest AUC) first
bayes_trials_results = sorted(bayes_trials.results, key = lambda x: x['loss'])
bayes_trials_results[:2]

We can also load a dataframe from the CSV, and use the ‘ast’ library to transform a text into a dictionary and feed our final model with the best result.

results = pd.read_csv('./gbm_trials.csv')# Sort with best scores on top and reset index for slicing
results.sort_values('loss', ascending = True, inplace = True)
results.reset_index(inplace = True, drop = True)
results.head()
import ast# Convert from a string to a dictionary
ast.literal_eval(results.loc[0, 'params'])
# Extract the ideal number of estimators and hyperparameters
best_bayes_estimators = int(results.loc[0, 'estimators'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
# Re-create the best model and train on the training data
best_bayes_model = lgb.LGBMClassifier(n_estimators=best_bayes_estimators, n_jobs = -1,
objective = 'binary', random_state = 7,
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
# Evaluate on the testing data
preds = best_bayes_model.predict(X_valid)
print(confusion_matrix(y_valid, preds))
print (best_bayes_model.score(X_valid,y_valid))
skplt.metrics.plot_confusion_matrix(y_valid, preds,figsize=(8,8))
resul.iloc[3,7] = best_bayes_model.score(X_valid,y_valid)
resul.head()
Hyperopt applyed on LGB model

Summary

We have developed the complete code to apply three available optimization techniques, on eight different classification algorithms.
The time and computational capacity required to apply these techniques is a topic to take into account; it seems necessary to perform this optimization when the state of the art of the selected model is high enough.
Parameter optimization does not necessarily imply a continuing improvement in the results of the trained model, over the test data, since the selection of parameters could be generating overfitting.
Finally, it should be noted that the improvements observed on the different models have generally been of a considerable magnitude, which leaves at least the door open to the possibility of increasing the performance of the algorithm through any of these techniques.

Sources and references

[1] https://en.wikipedia.org/w/index.php?title=Accuracy_and_precision&action=edit&section=4

In binary classification Accuracy is also used as a statistical measure of how well a binary classification test correctly identifies or excludes a condition. That is, the accuracy is the proportion of true results (both true positives and true negatives) among the total number of cases examined… The formula for quantifying binary accuracy is:

Accuracy = (TP+TN)/(TP+TN+FP+FN)

where: TP = True positive; FP = False positive; TN = True negative; FN = False negative

[2] https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/

[3] https://towardsdatascience.com/automated-machine-learning-hyperparameter-tuning-in-python-dfda59b72f8a

[4] https://medium.com/apprentice-journal/hyper-parameter-optimization-c9b78372447b

[5] https://towardsdatascience.com/automated-machine-learning-hyperparameter-tuning-in-python-dfda59b72f8a

[6] https://github.com/WillKoehrsen/hyperparameter-optimization/blob/master/Bayesian%20Hyperparameter%20Optimization%20of%20Gradient%20Boosting%20Machine.ipynb

--

--