Simple Guide to Hyperparameter Tuning in Neural Networks

Matthew Stewart, PhD
Towards Data Science
15 min readJul 9, 2019

--

A step-by-step Jupyter notebook walkthrough on hyperparameter optimization.

Image courtesy of FT.com.

This is the fourth article in my series on fully connected (vanilla) neural networks. In this article, we will be optimizing a neural network and performing hyperparameter tuning in order to obtain a high-performing model on the Beale function — one of many test functions commonly used for studying the effectiveness of various optimization techniques. This analysis can be reused for any function, but I recommend trying this out yourself on another common test function to test your skills. Personally, I find that optimizing a neural network can be incredibly frustrating (although not as bad as a GAN, if you’re familiar with those..) unless you have a clear and well-defined procedure to follow. I hope you enjoy this article and find it insightful.

You can access the previous articles below. The first provides a simple introduction to the topic of neural networks, to those who are unfamiliar. The second article covers more intermediary topics such as activation functions, neural architecture, and loss functions.

All related code can now be found in my GitHub repository:

Beale’s Function

Neural networks are fairly commonplace now in industry and research, but an embarrassingly large proportion of them are unable to work with them well enough to be able to produce high-performing networks that are capable of outperforming most other algorithms.

When applied mathematicians develop a new optimization algorithm, one thing they like to do is test it on a test function, which is sometimes called an artificial landscape. These artificial landscapes help us find a way of comparing the performance of various algorithms in terms of their:

  • Convergence (how fast they reach the answer)
  • Precision (how close do they approximate the exact answer)
  • Robustness (do they perform well for all functions or just a small subset)
  • General performance (e.g. computational complexity)

From just scrolling down the Wikipedia article on optimization test functions, you can see that some of the functions are pretty nasty. Many of them have been chosen as they highlight specific issues that can plague optimization algorithms. For this article, we will be looking at a relatively innocuous-looking function called the Beale function.

The Beale function looks like this:

The Beale function.

This function does not look particularly terrifying, right? The reason this is a test function is that it assesses how well the optimization algorithms perform when in flat regions with very shallow gradients. In these cases, it is particularly difficult for gradient-based optimization procedures to reach any minimum, as they are unable to learn effectively.

The remainder of this article will follow the Jupyter notebook tutorial on my GitHub repository. We will discuss the way in which one would tackle this kind of artificial landscape. This landscape is analogous to the loss surface of a neural network. When training a neural network, the goal is to find the global minimum on the loss surface by performing some form of optimization — typically stochastic gradient descent.

By learning how to approach a difficult optimization function, the reader should be more prepared to deal with real-life scenarios for implementing neural networks.

For those of who reading that are not familiar with the Jupyter notebook, feel free to read more about it here.

Before we touch any neural networks, we first have to define the function and find its minimum (otherwise, how will we know we got the right answer?). The first step (after importing any relevant packages) is to define the Beale function in our notebook:

We then set some function boundaries since we have ballpark estimates for where the minimum is in this case (from our plot), as well as a step size for our grid mesh.

We then make a mesh grid of points based on this information and are ready to find the minimum.

Now we make a (terrible) initial guess.

We then use the scipy.optimize function and see what answer pops out.

This is the result:

It looks like the answer is (3, 0.5), and if you plug these values into the equation, you find that this is the minimum (it also says this on the Wikipedia page).

In the next section, we will start on our neural network.

Optimization in Neural Networks

A neural network can be defined as a framework that combines inputs and tries to guess the output. If we are lucky enough to have some results, called “the ground truth”, to compare the outputs produced by the network, we can calculate the error. So the network makes a guess, calculates some error function, guesses again while trying to minimize this error, and guesses again until the error does not go down anymore. This is optimization.

In neural networks, the most commonly used optimization algorithms are flavors of GD (gradient descent). The objective function used in gradient descent is the loss function we want to minimize.

This tutorial will focus on Keras now, so I will give a brief Keras refresher.

A Keras Refresher

Keras is a Python library for deep learning that can run on top of both Theano and TensorFlow, two powerful Python libraries for fast numerical computing created and released by Facebook and Google, respectively.

Keras was developed to make developing deep learning models as fast and easy as possible for research and practical applications. It runs on Python 2.7 or 3.5 and can seamlessly execute on GPUs and CPUs.

Keras is built on the idea of a model. At its core, we have a sequence of layers called the Sequential model, which is a linear stack of layers. Keras also provides the functional API, a way to define complex models, such as multi-output models, directed acyclic graphs, or models with shared layers.

We can summarize the construction of deep learning models in Keras using the Sequential model as follows:

  1. Define your model: create a Sequential model and add layers.
  2. Compile your model: specify loss function and optimizers and call the .compile() function.
  3. Fit your model: train the model on data by calling the .fit() function.
  4. Make predictions: use the model to generate predictions on new data by calling functions such as .evaluate() or .predict().

You may be asking yourself — how can you examine the model's performance as it is running? This is a good question, and the answer is by using callbacks.

Callbacks: taking a peek into our model while it’s training

You can look at what is happening in various stages of your model by using callbacks. A callback is a set of functions to be applied at given stages of the training procedure. You can use callbacks to get a view of internal states and statistics of the model during training. You can pass a list of callbacks (as the keyword argument callbacks) to the .fit() method of the Sequential or Model classes. The relevant methods of the callbacks will then be called at each stage of the training.

  • A callback function you are already familiar with is keras.callbacks.History(). This is automatically included in .fit().
  • Another very useful one is keras.callbacks.ModelCheckpoint which saves the model with its weights at a certain point in the training. This can prove useful if your model is running for a long time and a system failure happens. Not all is lost, then. It's a good practice to save the model weights only when an improvement is observed as measured by the acc, for example.
  • keras.callbacks.EarlyStopping stops the training when a monitored quantity has stopped improving.
  • keras.callbacks.LearningRateScheduler will change the learning rate during training.

We will apply some callbacks later. For full documentation on callbacks see https://keras.io/callbacks/.

The first thing we must do is import a lot of different functions to make our lives easier.

Another step you can do if you want your network to work using random numbers but for the result to be repeatable, is to use a random seed. This produces the same sequence of numbers each time, although they are still pseudorandom (these are a great way for comparing models and also testing for reproducibility).

Step 1 — Deciding on the network topology (not really considered optimization but is very important)

We will use the MNIST dataset, which consists of grayscale images of handwritten digits (0–9) whose dimension is 28x28 pixels. Each pixel is 8 bits, so its value ranges from 0 to 255.

Obtaining the dataset is very easy since there is a function for it built-in to Keras.

Our output for our X and Y data is (60000, 28, 28) and (60000,1) respectively. It is always a good suggestion to print some of the data to check the values (and the data type if necessary).

We can check the training data by looking at one image of each of the digits to make sure that none of them are missing from our data.

The last check is for the dimensions of the training and test sets, which can be done relatively easily:

We find that we have 60,000 training images and 10,000 test images. The next thing to do is preprocess the data.

Preprocessing the data

To run our NN, we need to preprocess the data (these steps can be performed interchangeably):

  • First, we must make the 2D image arrays into 1D (flatten them). We can either perform this by using array reshaping with numpy.reshape() or the keras' method for this: a layer called keras.layers.Flatten which transforms the format of the images from a 2d-array (of 28 by 28 pixels), to a 1D-array of 28 * 28 = 784 pixels.
  • Then we need to normalize the pixel values (give them values between 0 and 1) using the following transformation:

In our case, the minimum is zero, and the maximum is 255, so the formula becomes simply 𝑥:=𝑥/255.

We now want to one-hot encode our data.

Now we are finally ready to build our model!

Step 2 — Adjusting the learning rate

One of the most common optimization algorithms is Stochastic Gradient Descent (SGD). The hyperparameters that can be optimized in SGD are learning rate, momentum, decay and nesterov.

Learning rate controls the weight at the end of each batch and momentum controls how much to let the previous update influence the current weight update. Decay indicates the learning rate decay over each update and nesterov takes the value “True” or “False” depending on if we want to apply Nesterov momentum.

Typical values for those hyperparameters are lr=0.01, decay=1e-6, momentum=0.9, and nesterov=True.

The learning rate hyperparameter goes into the optimizer function which we will see below. Keras has a default learning rate scheduler in the SGDoptimizer that decreases the learning rate during the stochastic gradient descent optimization algorithm. The learning rate is decreased according to this formula:

lr=lr×1/(1+decay∗epoch)

Source: http://cs231n.github.io/neural-networks-3

Let’s implement a learning rate adaptation schedule in Keras. We'll start with SGD and a learning rate value of 0.1. We will then train the model for 60 epochs and set the decay argument to 0.0016 (0.1/60). We also include a momentum value of 0.8 since that seems to work well when using an adaptive learning rate.

Next, we build the architecture of the neural network:

We can now run the model and see how well it performs. This took around 20 minutes on my machine and may be faster or slower, depending on your machine.

After it has finished running, we can plot the accuracy and loss function as a function of epochs for the training and test sets to see how the network performed.

The loss function plot looks as follows:

Loss as a function of epochs.

And this is the accuracy:

We will now look at applying a customized learning rate.

Apply a custom learning rate change using LearningRateScheduler

Write a function that performs the exponential learning rate decay as indicated by the following formula:

𝑙𝑟=𝑙𝑟₀ × 𝑒^(−𝑘𝑡)

This is similar to before, so I will do this in one code block and describe the differences.

We see here that the only thing that has changed here is the presence of the exp_decay function that we defined and its use in the LearningRateScheduler function. Notice we also chose to add a few callbacks to our model this time.

We can now plot the learning rate and loss functions as functions of the number of epochs. The learning rate plot is incredibly smooth as it follows our predefined exponentially decaying function.

The loss function also looks smoother now as compared to before.

This shows you that developing a learning rate scheduler can be a helpful way to improve neural network performance.

Step 3 — Choosing an optimizer and a loss function

When constructing a model and using it to make our predictions, for example, to assign label scores to images (“cat,” “plane,” etc.), we want to measure our success or failure by defining a “loss” function (or objective function). The goal of optimization is to efficiently calculate the parameters/weights that minimize this loss function. keras provides various types of loss functions.

Sometimes the “loss” function measures the “distance.” We can define this “distance” between two data points in various ways suitable to the problem or dataset. The distance used depends on the data type and the problem being tackled. For example, in natural language processing (which analyses textual data), the Hamming distance is much more common.

Distance

  • Euclidean
  • Manhattan
  • others, such as Hamming, which measures distances between strings, for example. The Hamming distance of “carolin” and “cathrin” is 3.

Loss functions

  • MSE (for regression)
  • categorical cross-entropy (for classification)
  • binary cross entropy (for classification)

Step 4 — Deciding on the batch size and number of epochs

The batch size defines the number of samples propagated through the network.

For instance, let’s say you have 1000 training samples, and you want to set up a batch_size equal to 100. The algorithm takes the first 100 samples (from 1st to 100th) from the training dataset and trains the network. Next, it takes the second 100 samples (from 101st to 200th) and trains the network again. We can keep doing this procedure until we have propagated all samples through the network.

Advantages of using a batch size < number of all samples:

  • It requires less memory. Since you train the network using fewer samples, the overall training procedure requires less memory. That’s especially important if you cannot fit the whole dataset in your machine’s memory.
  • Typically networks train faster with mini-batches. That’s because we update the weights after each propagation.

Disadvantages of using a batch size < number of all samples:

  • The smaller the batch, the less accurate the estimate of the gradient will be.

The number of epochs is a hyperparameter that defines the number of times that the learning algorithm will work through the entire training dataset.

One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. An epoch is comprised of one or more batches.

There are no hard and fast rules for selecting batch sizes or the number of epochs, and there is no guarantee that increasing the number of epochs provides a better result than a lesser number.

Step 5 — Random restarts

This method does not seem to have an implementation in keras. It can be done easily by altering keras.callbacks.LearningRateScheduler. I will leave this as an exercise for the reader, but it essentially involves resetting the learning rate after a specified number of epochs for a finite number of times.

Tuning Hyperparameters using Cross-Validation

Now instead of trying different values by hand, we will use GridSearchCV from Scikit-Learn to try out several values for our hyperparameters and compare the results.

To do cross-validation with keras we will use the wrappers for the Scikit-Learn API. They provide a way to use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow.

There are two wrappers available:

keras.wrappers.scikit_learn.KerasClassifier(build_fn=None, **sk_params), which implements the Scikit-Learn classifier interface,

keras.wrappers.scikit_learn.KerasRegressor(build_fn=None, **sk_params), which implements the Scikit-Learn regressor interface.

Trying Different Weight Initializations

The first hyperparameter we will try to optimize via cross-validation is different weight initialization.

The results from our GridSearch are:

We see that the best results are obtained either from the model using lecun_uniform initialization or glorot_uniform initialization and that we can achieve close to 97% accuracy with our network.

Save Your Neural Network Model to JSON

The Hierarchical Data Format (HDF5) is a data storage format for storing large arrays of data, including values for the weights in a neural network.

You can install the HDF5 Python module: pip install h5py

Keras allows you to describe and save any model using the JSON format.

Cross-validation with more than one hyperparameters

Usually, we are not interested in looking at how just one parameter changes, but how multiple parameter changes can affect our results. We can do cross-validation with more than one parameters simultaneously, effectively trying out combinations of them.

Note: Cross-validation in neural networks is computationally expensive. Think before you experiment! Multiply the number of features you are validating on to see how many combinations there are. Each combination is evaluated using the k-fold cross-validation (k is a parameter we choose).

For example, we can choose to search for different values of:

  • batch size
  • number of epochs
  • initialization mode

The choices are specified in a dictionary and passed to GridSearchCV.

We will now perform a GridSearch for batch size, number of epochs and initializer combined.

One last question before we end: what do we do if the number of parameters and values we have to cycle through in our GridSearchCV is particularly large?

This can be a particularly troublesome problem — imagine a situation where there are five parameters being selected for and 10 potential values that we have selected for each parameter. The number of unique combinations of this is 10⁵, which means we would have to train a ridiculously large number of networks. It would be insanity to do it this way, so it is common to use RandomizedCV as an alternative.

RandomizedCV allows us to specify all of our potential parameters. Then for each fold in the cross-validation, it selects a random subset of parameters to use for the current model. In the end, the user can select the optimal set of parameters and use these as an approximate solution.

Final Comments

Thank you for reading, and I hope you found this article helpful and insightful. I look forward to hearing from readers about their applications of this hyperparameter tuning guide. The next article in this series will cover some of the more advanced aspects of fully connected neural networks.

Further Reading

Deep learning courses:

NLP-oriented:

Vision-oriented:

Important neural network articles:

--

--

ML Postdoc @Harvard | Environmental + Data Science PhD @Harvard | ML consultant @Critical Future | Blogger @TDS | Content Creator @EdX. https://mpstewart.io