The world’s leading publication for data science, AI, and ML professionals.

Tune Your Scikit-learn Model Using Evolutionary Algorithms

Scikit-learn hyperparameters tuning with evolutionary algorithms and cross-validation.

Photo by Susan Q Yin on Unsplash
Photo by Susan Q Yin on Unsplash

Hyperparameter tuning is an essential part of the machine learning pipeline—most common implementations use a grid search (random or not) to choose between a set of combinations.

This article will use Evolutionary Algorithms with the python package sklearn-genetic-opt to find the parameters that optimizes our defined cross-validation metric. This package has some functionalities that can make this process easier:

  • Hyperparameters search by using several evolutionary algorithms.
  • Callbacks to stop the optimization when you meet a criterion, log objects into a local system, or customize your logic.
  • Logging capabilities into a Logbook object or with MLflow build-in integration.
  • Utils plots to understand the optimization process.

Dataset

As a demonstration dataset, we’ll use the digits data from Scikit-learn; the idea is to classify handwritten digits of images of 8×8 pixels. A sample of the data is as follows:

Digits dataset. Image by Scikit-learn (BSD License)
Digits dataset. Image by Scikit-learn (BSD License)

The Evolutionary Algorithm

Most evolutionary algorithms start with a population of individuals; each represents a set of hyperparameters to use in our Machine Learning model.

Using some mechanisms that try to emulate the way populations evolve, the algorithm reproduces, mutates, and selects new hyperparameters based on the results of the already tested parameters, using some metric to define its fitness (for example, the cross-validation accuracy) and repeats this process over several generations of individuals, here is an extended explanation of the process. Visually, it could look like this:

Evolutionary Cross-Validation. Image by the author.
Evolutionary Cross-Validation. Image by the author.

If you want a more in-depth explanation, you can check my other medium post explaining the theoretical aspects of this process.

Implementation

The way the algorithm works may sound a bit confusing, but there are already some packages such as DEAP in Python which already have optimized routines for this algorithm.

In this case, we will use sklearn-genetic-opt, a python package built on top of DEAP and scikit-learn to make this Optimization process more straightforward.

First, let’s install it:

pip install sklearn-genetic-opt

Now, let’s import the data, split it in our train and test set, and create an instance of any scikit-learn classifier; for this example, I will use a Random Forest Classifier.

Now, we can use sklearn-genetic-opt to tune our classifier based on a metric; in this case, I will choose the accuracy score from stratified cross-validation with three splits.

The param_grid is similar to scikit-learn, but we must use the space classes to let sklearn-genetic-opt know which type of data to use to sample the parameters.

To define this and others options, we must use the principal class of the package called GASearchCV.

At the end of this article, I’ll explain the parameters shown so you can change them if you want.

The estimator must be a sckit-learn classifier or regressor, cv is the number of splits in the cross-validation or a cross-validation generator, and scoring is the metric chosen to optimize—it must be one of the sklearn metrics that are compatible with the estimator.

Now we can fit our model and use it in our test data; by default, it will use the best set of hyperparameters found:

evolved_estimator.fit(X_train,y_train)
y_predicy_ga = evolved_estimator.predict(X_test)
accuracy_score(y_test,y_predicy_ga)

You should see something like this:

Training verbose. Image by the Author.
Training verbose. Image by the Author.

While the algorithm is running, it shows us the metrics it’s achieving at each generation; the "fitness" refers to the metric we chose, in this case, the accuracy. As the algorithm uses more than one set of hyperparameters per generation, it shows the average accuracy, standard deviation, and maximum and minimum individual values.

In this particular run, we got an accuracy in the test set of 0.93 with this set of hyperparameters:

Optimization result. Image by the Author.
Optimization result. Image by the Author.

We can also see the evolution of the optimization routine, using the command:

plot_fitness_evolution(evolved_estimator)
plt.show()
Fitness Evolution. Image by the Author.
Fitness Evolution. Image by the Author.

As you can see, the algorithm started with an accuracy of around 0.8 in generation 0, which generated hyperparameters randomly. But the accuracy improves while the algorithm chooses a new set of hyperparameters using evolutionary strategies. The algorithm probably hasn’t wholly converged to its best fitness value but already got up to 0.94, so you can leave it running some extra generations to see if you can get better accuracy.

We can also have the log of all the model’s hyperparameters and its cross-validation score.

Log of hyperparameters used. Image by the Author.
Log of hyperparameters used. Image by the Author.

For example, you can use those logs to plot the distribution of the parameters that the algorithm is selecting to see how its exploration and exploitation strategy were.

plot_search_space(evolved_estimator)
plt.show()
Sampled hyperparameters distribution. Image by the Author.
Sampled hyperparameters distribution. Image by the Author.

The k (keep_top_k=4) best set of hyperparameters combinations:

Hall of Fame. Image by the Author.
Hall of Fame. Image by the Author.

So that is it! As you can see, it is pretty straightforward to make this optimization routine using evolutionary algorithms with sklearn-genetic-opt; this is an open-source project that can help you to choose your hyperparameters as an alternative to methods such as scikit-learn’s RandomizedSearchCV or GridSearch, which depends on pre-defined combinations of hyperparameters to try.

As the author of the package, any suggestion, contribution, or comment is welcome. Here you can see more examples and the source code used:

sklearn-genetic-opt – documentation

The following are the definitions at the time of writing, but make sure to check the latest documentation for updates.

Appendix 1: Space definition

Define the search space with the parameter ‘param_grid’; it takes a dictionary with some of these classes:

Categorical: Represents categorical variables, expects a list of options it can sample, and optionally the probability of sampling each option

Integer: Represents integer variables, expects a lower and upper bound of the variable

Continuous: Represents real-valued variables, expects a lower and upper bound of the variable, optionally can sample the values from a log-uniform distribution if the variable is positively defined

Appendix 2: parameters of GASearchCV

population: It’s the initial amount of hyperparameters candidates to generate randomly.

generations: How many iterations the algorithm will make, it creates a new population every generation

elitism: If true, it uses a tournament selection keeping the best k individuals; if false, it uses a roulette selection mechanism.

tournament size: How many individuals to select utilizing a tournament operator, only used if elitism=True.

crossover_probability: The probability that a crossover occurs in a particular mating.

mutation_probability: The probability that an already fixed individual, suffers a random change in some of its hyperparameters values.

param_grid: dictionary with the keys as the names of the hyperparameters, and the values some of the Categorical, Continuous, or Integer from sklearn_genetic.space

Defines the possible values that this parameter can take.

criteria: ‘max’ if the ‘scoring’ metric choose, it’s considered better as its value increases, ‘min’ otherwise.

algorithm: The specific evolutionary algorithm from the package deap to use, version 0.5.0 of sklearn-genetic-opt supports eaSimple, eaMuPlusLambda, and eaMuCommaLambda.

n_jobs: How many concurrent jobs to launch over the cross-validation step

verbose: If True, it displays some metrics over the optimization while the algorithm runs.

keep_top_k: Based on its final cross-validation score, how many sets of hyperparameters to return at the end of the optimization. This parameter determines the size of the hof.

log_config: If you set this parameter with the MLflowConfig, the metrics, hyperparameters, and models get logged into an MLflow server.


Related Articles