Hyperparameter Optimization with Keras

Finding the right hyperparameters for your deep learning model can be a tedious process. It doesn’t have to.

Mikko
Towards Data Science

--

TL;DR

With the right process in place, it will not be difficult to find state-of-the-art hyperparameter configuration for a given prediction task. Out of the three approaches — manual, machine-assisted, and algorithmic — this article will focus on machine-assisted. The article will cover how I do it, get to the proof that the method works, and provide the understanding of why it works. The main principle is simplicity.

Few Words on Performance

The first point about performance relates to the issue of accuracy (and other more robust metrics) as a way to measure model performance. Consider f1 score as an example. If you have a binary prediction task with 1% positives, then a model that makes everything a 0 will get close to perfect f1 score and accuracy. This can be handled with some changes to the way f1 score deals with corner cases such as “all zeros,” “all ones,” and “no true positives.” But that’s a big topic, and outside of the scope of this article, so for now I just want to make it clear that this problem is a very important part of getting systemic hyperparameter optimization to work. We have a lot of research in this field, but the research is focused more on algorithms, and less on the fundamentals. Indeed, you can have the fanciest algorithm in the world — often also really complex — making decisions based on a metric that does not make sense. That’s not going to be hugely useful for dealing with “real-life” problems.

Make no mistake; EVEN WHEN WE DO GET THE PERFORMANCE METRIC RIGHT (yes I’m yelling), we need to consider what happens in the process of optimizing a model. We have a training set, and then we have a validation set. As soon as we start to look at the validation results, and start making changes based on that, we start to create a bias towards the validation set. Now we end up with the training results that are a product of the bias the machine has, and we have the validation results, that is the product of the bias we have. In other words, the model we get as a result does not have the properties of a well-generalized model. Instead, it’s biased away from being generalized. So it would be very important to keep this point in mind.

The key point about a more advanced fully-automated (unsupervised) approach to hyperparameter optimization, involves first solving these two problems. Once these two are solved — and yes there are ways to do that — the resulting metrics would need to be implemented as a single score. Then that score becomes the metric against which the hyperparameter optimization process is optimized. Otherwise, no algorithm in the world will help, as it will optimize towards something else than what we are after. What are we after again? A model that will do the task that the prediction task articulates. Not just one model for one case (which is often the case in the papers covering the topic), but all kinds of models, for all kinds of prediction tasks. That is what a solution such as Keras allows us to do, and any attempt to automate parts of the process of using a tool such as Keras should embrace that idea.

What Tools Did I Use?

For everything in this article, I used Keras for the models, and Talos, which is a hyperparameter optimization solution I built. The benefit is that it exposes Keras as-is, without introducing any new syntax. It allows me to do in minutes what used to take days while having fun instead of painful repetition.

You can try it for yourself:

pip install talos

Or look at the codes / docs here.

But the information I want to share, and the point I want to make, is not related to a tool, but the process. You could follow the same procedure any which way you like.

One of the more prominent issues with automated hyperparameter optimization and related tools is that you generally tend to end up far away from the way you’re used to working. The key to successful prediction-task-agnostic hyperparameter optimization — as is with all complex problems — is in embracing cooperation between man and the machine. Every experiment is an opportunity to learn more about the practice (of deep learning) and the technology (in this case Keras). That opportunity should not be missed at the expense of process automation. At the same time, we should be able to take away the blatantly redundant parts of the process. Think of doing shift-enter in Jupyter for a few hundred times and waiting for a minute or two between each iteration. In summary, at this point, the goal should not be in a fully-automated approach to finding the right model, but in minimizing procedural redundancy on burdening the human. Instead of mechanically operating the machine, the machine operates itself. Instead of analyzing the results of various model configurations one by one, I want to analyze them by the thousands or by hundreds of thousands. There are over 80,000 seconds in a day, and a lot of parameter space can be covered in that time without me having to do anything about it.

Let’s Get Scannin’

For the sake of example, I will first provide the code that I used throughout the experiment covered in this article. The dataset I used is the Wisconsin Breast Cancer dataset.

Once the Keras model is defined, it’s time to decide the initial parameter boundaries. The dictionary gets then fed into the process in a way where a single permutation is picked once, and then disregarded.

Depending on the losses, optimizers, and activations we want to include in the scan, we’ll need to import those functions/classes from Keras first. Next, with the model and parameters ready, it’s time to start the experiment.

Note that I’m not going to share more codes, as all I did was change the parameters in the parameter dictionary related with the insights provided in the following sections. For completeness, at the end of the post, I will share a link to a notebook with the codes.

Because there are many permutations (over 180,000 in total) in this first round of the experiment, I randomly pick just 1% of the total, and we’re left with 1,800 permutations.

In this case, I’m running from a 2015 make MacBook Air, and it looks like I have just the time to meet a friend and have a cup of coffee (or two).

Hyperparameter Scanning Visualized

For this article, using the Wisconsin Breast Cancer dataset, I’ve set up the experiment assuming no previous knowledge about optimal parameters, or the dataset. I’ve prepared the dataset by dropping one column, and by transforming all the rest so that the mean for each feature is 0, with standard deviation of 1.

After the initial run of 1,800 permutations, it’s time to look at the results and decide on how we’re going to limit (or otherwise alter) the parameter space.

A simple rank order correlation shows that lr (learning rate) has the strongest effect on our performance metric, which in this case is val_acc (validation accuracy). For this dataset, val_acc is ok, as there is a good number of Positives. For datasets where there is a significant disparity between Falses and Positives, accuracy is not a good metric. It seems that hidden_layers, lr (learning rate), and dropout all have a notable negative correlation with val_acc. A simpler network will do better in this task. For positive correlations, the number of epochs is the only one to stands out. Let’s look a little closer. In the below graph we have epochs (50, 100 and 150) on the x-axis, val_acc on the y-axis, learning rate in the columns and dropout as hue. The trend seems to be generally as the correlation suggests; smaller dropouts do better than larger ones.

Another way to look at the dropout is through Kernel Density Estimation. Here we can see that there is a slight tendency towards higher val_acc with dropout 0 or 0.1, as well as less tendency to have low val_acc (around the 0.6 mark).

The first action item for the next round of scanning is to get rid of the higher dropout rates altogether and focus on values between 0 and 0.2. Let’s take a look at learning rate more closely next. Note that learning rates are normalized across optimizers to a scale where 1 represents the Keras default value of that optimizer.

The situation is pretty clear; the smaller learning rates work well for both loss function, and the difference is particularly pronounced with logcosh. But because binary cross-entropy is clearly outperforming on all learning rate levels, it will be the loss of our choice for the remainder of the experiment. A sanity check is still needed though. How about if what we’re seeing is not factoring in over-fitting towards the training data? How if val_loss is all over the place and we’re just getting carried away looking at one side of the picture? A simple regression analysis show that’s not the case. Other than a few outliers, everything is packed nicely in the lower left corner where we want it. The tendency is that both train and validation loss is close to zero.

I think for now we know enough; it’s time to set up the next round of the experiment! As a point of reference, the parameter space for the next experiment looks like this:

In addition to refining the learning rate, dropout, and batch size boundaries, I’ve added kernel_initializer ‘uniform.’ Remember that at this stage the objective is to learn about the prediction task, as opposed to being too focused on finding the solution. The key point here is experimentation and learning about the overall process, in addition to learning about the specific prediction challenge.

Round 2 — Increase the Focus on Result

Initially, the less we focus on the result (and more on the process), the more likely we’re going to get a good result. It’s like playing chess; if at first you’re too focused on winning the game, you will not focus on the opening and mid-game. Competitive chess is won in the endgame, based on playing a strong beginning and middle. If things go well, the second iteration in the hyperparameter optimization process is the middle. We’re not entirely focused on winning the game yet, but it helps to have the eye on the prize already. In our case, the results from the first round (94.1% validation accuracy) indicate that with the given dataset, and the set parameter boundaries, there are predictions to be made.

In this case, the prediction task here is to say if breast cancer is benign or malignant. This type of predictions is a kind of a big deal in the sense that both false positives and false negatives do matter. Getting the prediction wrong will have some negative effect on the person’s life. In case you are interested, there is a bunch of papers written on this dataset, and some other relevant info, which you can all find here.

The result for the second round is 96% validation accuracy. The below correlation shows that the only thing sticking out at this point, is the number of epochs, so for the third round that’s one thing I’m going to change.

If you look at the correlation alone, there is a danger of missing something in the bigger picture. In hyperparameter optimization, the big picture is about individual values within a given parameter, and their interconnectedness with all other values. Now that we’ve eliminated the logcosh loss function, and have just one loss (binary_crossentropy) in the parameter space, I want to learn a little bit about how the different optimizers are performing in the context of the epochs.

It is exactly like the correlation suggests regarding epochs (now on the x-axis). Because RMSprop underperforms in both 100 and 150, let’s also drop that from the next round.

Before moving on, let’s consider very briefly a fundamental question related to hyperparameter optimization as an optimization challenge. What is it that we’re trying to achieve? The answer can be summarized using two simple concepts;

  • Prediction optimum
  • Result entropy

Prediction optimum is where we have a model that is both precise and generalized. Result entropy is where the entropy is as close to zero (minimal) as possible. Result entropy can be understood as a measure of similarity between between all the results within a result set (one round of going through n permutations). The ideal scenario is where the prediction optimum is 1, which is 100% prediction performance and 100% generality, and the resulting entropy is 0. This means that no matter what we do within the hyperparameter space, we only get the perfect result every time. This is not feasible for several reasons but is helpful to keep in mind regarding the objectives of the process of optimizing the process of hyperparameter optimization. The other way to look at answering the question is through three levels of consideration;

  1. The prediction task, where the goal is to find a model that provides a solution for the task
  2. The hyperparameter optimization task, where the goal is to find the best model (with least effort) for the prediction task
  3. The hyperparameter optimization task optimization task, where the goal is to find the best approach to best approach to finding the best model for the prediction task

You might then ask if this leads us to an infinite progression where we then need optimizers on top of optimizers, and the answer is yes. In my view, what makes the hyperparameter optimization problem interesting, is the way it leads us to the solution for the problem of “models that build models.” But that would take us far from the scope of this article.

With the second, and particularly the third aspects in mind, we need to consider the computational efficiency of the process. The less we waste the computational resource, the more we have of it for finding the best possible result regarding the aspects one and two. Consider the below graphs in this light.

The second round KDE looks much better in the sense of having the resources allocated where we need them to be. They are closer to 1 on the x-axis, and there is very little in terms of “spillage” towards the 0. Whatever compute resources are going into the scan, they’re are doing important work. The ideal picture here is one of a single straight line with the x value of 1.

Round 3 — Generalization and Performance

Let’s get right to it. The peak validation accuracy is now 97.1%, and it looks like we’re going in the right direction. I made the mistake of just adding 175 epochs as max, and based on the below; it looks like we have to go further than that. At least with this configuration. Which makes me think…maybe for the last and final round, we should try something surprising.

As it was discussed in the foreword, it’s important to consider generalization as well. Every time we look at the result, there is the effect where our insights start to affect the experiment. The net result is that we start to get less generalized models that work well with the validation dataset, but might not work well with a “real-life” dataset. In this case, we don’t have a good way to test for this kind of bias, but at least we can take measures to assess the degree of pseudo-generalization with what we have. Let’s see training and validation accuracy first.

Even though this does not give us an affirmative confirmation of having a well-generalized model, in fact, it falls short from it a great deal; the regression analysis result could not be much better. Then let’s look at loss.

It’s even better. Things are looking good. For the last round, I’m going to increase the number of epochs, but I’m also going to try another approach. So far I’ve only had very small batch sizes, which take a lot of time to process. In the third round, I only included batch sizes 1 through 4. For the next, I’m going to throw in 30 or something, and see what that does.

A few words about early stopping. Keras provides a very convenient way to use callbacks through EarlyStopping functionality. As you might have noticed, I’m not using that. Very generally speaking, I would recommend using it, but it is not as trivial as everything we’ve done here so far. Getting the settings right in the way where it’s not limiting your ability to find the best possible results is not straightforward. The most important aspect has to do with metrics; I would want to have a custom metric created first, and then use that as my EarlyStopping mode (instead of using val_acc or val_loss). That said, EarlyStopping, and callbacks in general, provide a very powerful way to add to your hyperparameter optimization process.

Round 4 — Final Results are In

Before diving into the results, let’s look at one more visualization from the results of the last round. This time 5-dimensional. I wanted to see the remaining parameters — kernel initializer, batch size, hidden layers, and epochs — all in the same picture compared against validation accuracy and loss. First accuracy.

Mostly it’s neck-to-neck, but some things do stand out. The first thing is that if a hidden layer value (hue) is down, in most cases its one hidden layer. For batch sizes (columns) it’s hard to say, as is for kernel initializer (rows). Let’s next take a look at the validation loss on the y-axis, and see if we can learn more from there. And remember, here we’re looking for smaller values; we’re trying to minimize the loss function with each parameter permutation.

Uniform kernel initializer is doing a great job in keeping the loss down throughout all epoch, batch size, and hidden layer variations. But because the results are a little inconsistent, I’ll keep both initializers until the end.

And the Winner Is…

The winning combination is it came from the last minute idea to try bigger batch size to save time and in fewer epochs too):

The highest result for the small batch sizes was validation accuracy 97.7%. With the larger batch size approach, there is also the upside of having the model converge very fast. At the end of this article, I will provide a video where you can see it for yourself. To be honest, once I saw how well the bigger batch size worked, I did set up a separate test just focusing on that. It took less than a minute to set it up as all I needed to change was the batch size (and for this smaller epochs), and the scan finished in 60 minutes. Regarding the plots, there is nothing much to see, as more or less all the results were near 100%. There is one more thing though I want to share, as it relates to the idea of entropy from a different standpoint than what we had already discussed. Entropy can be an effective way to assess overfitting (and therefore a proxy for generalization). In this case, I measure the val_loss and val_acc entropy, using KL divergence, against training loss and accuracy respectively.

Summary of the Process

  • Start as simply and broadly as possible
  • Try to learn as much as possible about the experiment and your hypothesis
  • Try not to focus on the final result much for the first iterations
  • Make sure that your performance metric is right
  • Remember that performance is not enough, as it tends to lead you away from generality
  • Each iteration should reduce parameter space and model complexity
  • Don’t by afraid to try things, it is an experiment after all
  • Use methods you can understand e.g. clearly visualized descriptive statistics

Here is the code complete notebook for the last round. And the video I had promised…

Thanks for your time! If you have a few seconds more, please share. And have fun in your search for the best parameters!

--

--

Worked with machine intelligence for 15 years, and built the interwebs for 25. Nothing here is my own.