How to Find the Right Architecture for Neural Network and Fine Tune Hyperparameters

Asutosh Nayak
Towards Data Science
8 min readMar 18, 2019

--

Image by Gordon Johnson from Pixabay

This blog post is a continuation of my previous post on how to use LSTMs to predict stock price for a stock, given its historical data. We have seen how to compile a Keras LSTM model. Here we will see some of the ways in which we can find the right architecture and hyperparameters.

Here are some of our options:

  1. Hand Tuning or Manual search — This the the most painful method of finding right configuration wherein you try specific values for each parameters one by one. But with some experience, careful analysis of initial results and gut feeling it may prove really helpful.
  2. Grid Search — This is really the only way which can give you the best set of parameters out of all the options fed to it. You pass on a range of values for each parameter that you want to optimize, then train and find validation loss for each combination. As you can imagine this the most time consuming method and often not feasible.
  3. Random Search — This a subset of Grid Search which randomly selects subset of all possible combinations.
  4. Bayesian Optimization/Other probabilistic optimizations — This (Bayesian Optimization) method is really mathematically involved and to be honest I haven’t explored the maths in it. I will just give an overview of what it really does, even if you don’t know the internal workings you can still apply it in program like we will see later. Bayesian Optimization uses something called Gaussian Process to guess or model the objective function (the function we want to minimize by finding right set of hyperparameters). In this method a limit is set on how many times do we want to evaluate our objective function since it’s assumed to be very expensive. First a random set of points are decided within the range of parameters to observe the values of the function. Then using those results Gaussian process is used to guess the objective function. Post which an “acquisition function” is used to decide which point to sample next. And this process is repeated for “limit” number of times which is set above. You can refer this and this for in depth knowledge. There are other techniques too like TPE which is implemented using Hyperopt (code below). You can check this paper to understand TPE.

Grid Search Implementation

If you are using SK-Learn models then you can utilise their GridSearchCV directly. It is fairly straight forward to use. You can visit the link to their documentation above. An advantage is that it has option to run jobs in parallel too. If you are using Keras model then you will have to use the wrappers for Keras model as explained here.

But if it doesn’t work or you don’t want the overhead of learning the syntax for a new package you can implement a bare minimum grid search like this :

Other Smarter Search Implementation

There are several open source packages available for minimizing objective functions using other “smarter” search algorithms. I will be showing examples for HyperOpt and Talos. Below is how you would implement hyperparameter tuning using Hyperopt which uses TPE algorithm to minimize function.

Sample code for Hyperopt

In the above code snippet variable “search_space” holds the parameters and their values that you want to search. “fmin” function at the end is the actual function that does minimization. The program itself is pretty simple and I have added comments to it so it shouldn’t be a problem to understand. But I would like to dig deeper into formation of search space dictionary because that’s slightly awkward. Let’s put it under microscope :

The main dictionary holds all the keys for the parameters that we want to optimize. We would mostly deal with two functions (or stochastic expressions, as they call it) namely — choice and uniform. “hp.choice” takes in a list of values to try from. This function then returns one of the options, which should be a list or tuple. The string (first parameter) we pass to hp.choice and hp.uniform is used by Hyperopt for internal purposes mostly. “hp.uniform” returns a value uniformly between the 2nd and 3rd parameters passed (low and high). Where it gets tricky is, when you have parameter like “lstm_layers”, which defines the number of LSTM layers in the network. For such parameters we have two new sets of parameters to test : first when number of LSTM layers in the network is one, second when two LSTMs layers are used. In this case you can see that I have used hp.choice to tell the system my choices for this parameter (lstm_layers) and the value for this parameter is given by the key “layers”. When Hyperopt is testing the model with two LSTM layers it will consider 2 other parameters to test namely — number of nodes in 2nd LSTM layer (lstm2_nodes) and drop out to be used for 2nd LSTM layer (lstm2_dropouts). I have kept first lstm layer blank but you can include other parameters to test too.

I hope I was clear with how to build sample space for practical purposes. Now let’s look into another library that claims to be using probabilistic methods (in combination with grid or random) to reduce the number of evaluations — Talos.

Code snippet for Talos

The usage of Talos is similar to previous tool; you have to create a function that builds model, trains it, evaluates it on a validation data. Only difference is that model function returns Keras history object and model instead of a dictionary. Another thing to note here is that we pass X and Y here in Scan function but they are never used since we want to build data based on selected “batch_size” and “time_steps” in current iteration. But for simpler problems it would be easier.

You could also use Hyperas (Hyperopt + Keras), which is a wrapper around Hyperopt. Main advantage is you don’t need to learn any new syntax/functions of Hyperopt. All you have to do is define a search space dictionary like before and build your model like shown below. All you have to do is place the values you want to test for a parameter within double curly braces ( e.g. {{ [1, 2, 3] }} ).

But Hyperas didn’t work in this case, since I was calling ‘data’ function from ‘model’ function and the syntax of using double curly braces caused some problem. I didn’t dig into the issue as I already had other tools. Nevertheless I think its worth mentioning here, as it’s lot easier.

This is all fine but how to know that the result returned by any of these tools is the best? If you ask me, we won’t know. You have to weigh the trade-off — how much time you have to spend on fine tuning versus how much validation loss is good enough for you to go ahead. In my opinion it would be best if you tried to spend some time doing hand tuning on the result of above tools.

For instance when I was working on the stocks dataset I first wrote my own implementation of grid search and ran it on cloud. Then, I tried these above tools for smart tuning but unfortunately the cloud VM was quite slow (slower than my laptop!) that day and I was running out of patience. Fortunately, by the time I read about these tools, implemented and started running them on cloud, my grid search was done. I had ran grid search on these values :

search_params = {
"batch_size": [20, 30, 40],
"time_steps": [30, 60, 90],
"lr": [0.01, 0.001, 0.0001],
"epochs": [30, 50, 70]
}

This search with only 4 parameters (81 combinations) ran for 24 hours! I don’t have the result to share here because I forgot to implement logging in excitement (mistake 1). By using the results from grid search, I got disappointing prediction :

Initial Result

But I had not optimized other things like number of layers etc ( I had started with a Neural Net having 2 LSTM layers and 1 dense layer — mistake 2. Always start with simpler model, test the waters and then build your way up). Anyway, I decided to optimize further manually. I took the best result from grid search and tried other parameters. No matter how hard I tried I couldn’t improve the loss much. I knew it’s overfitting so I tried increasing the dropouts but to no avail. Then I had an epiphany that may be I am trying too hard and 1 LSTM layer is enough (yeah, I know I’m stupid — mistake 3 underestimating the power of neural networks). But how do you know if model is overfitting? It’s actually quite simple. In my case the training error vs validation error looked like this:

First, when you see a big gap between loss on training data and validation data it’s overfitting. Logic behind this is that your model is learning very well from your training data but it’s unable to generalize it to the validation data (new data). Second thing is the uncanny shape of validation error plot. It’s all over the place. This means the model is just predicting random values on new data that’s why there’s almost no relation between validation losses across the epochs. Another thing to look for is the epoch logs. If you see that training loss keeps on decreasing but validation loss fluctuates or remains same after some time it’s probably overfitting.

So I removed second layer of LSTM and added dropout layer with value higher than what I had been using (0.5 from 0.2). And voila!

final plot
final train vs validation loss. Notice how the gap is gradually reducing

Drastic improvement right? I am pretty sure that with more effort we can make it even better.

I am not sharing code snippet for training the model as it’s same as shown in above snippets; only with right parameters. You can find all the complete programs on my Github profile here.

In the next article I will be sharing some important tools/tips that helped me a lot during this project and are usually not given enough diligence.

--

--