The world’s leading publication for data science, AI, and ML professionals.

Learning Rates for Deep Learning Models

How to make good models great through optimization

Hands-on Tutorials

Photo by Stephen Pedersen on Unsplash
Photo by Stephen Pedersen on Unsplash

Deep Learning models are incredibly flexible, but a great deal of care is required to make them effective. The choice of learning rate is crucial.

This article is the first in a series about fine-tuning deep learning models.

This article will discuss the effects of a Learning Rate on the convergence and performance of deep learning models. The model used is a simple neural network to showcase the difference between learning rates and learning schedules.

Topics covered in this article:

  • The intuition behind different learning rates and optimization problems.
  • Learning Schedulers with Exponential Decay.
  • Cyclical Learning Rate Schedulers.
  • Custom Learning Rate Schedulers.

Despite the flexibility of deep learning models, to create a robust model, you need to consider how you will train the model. For example, a flexible, dynamic model may be capable of phenomenal performance, but the results are not realized without the proper time and care.

There are many possible ways to improve a deep learning model. These include the choice of activation function, learning rate, optimizer, batch size, weight initialization, and many other aspects of deep learning models.

While each choice is critically important to the overall model, the focus here is only on the choice of learning rate.

Learning Rate

Usually, when discussing learning rates, you’ll see an image like the following.

Learning Rates (good— left) and (too big - right) (Photo by Author)
Learning Rates (good— left) and (too big – right) (Photo by Author)

The function is this perfect parabola, and things either nicely converge or go wildly out of control. Unfortunately, this example is far too simple and doesn’t allow you to understand the problems faced with optimization problems.

Instead, let’s start without a mathematical representation of learning rates. Imagine instead that you are on a mountainside. Your goal, find the highest point of the highest mountain.

Optimization as a Landscape

The problem is that there is an incredible amount of fog. You cannot see the mountain peak. You don’t even know where you are on the mountainside. All you know is where you are standing. The most straightforward approach to finding the mountain peak would be to keep going up. Whenever you can, keep walking towards the higher ground.

But if the fog is too thick, you can’t even determine which way is up. This lack of orientation is why optimization techniques use gradients. A gradient gives you the slope at the point you are standing. So now, despite the fog, you know which way is up.

With your new knowledge of the direction to travel, you want to head in that direction. But how big of a step should you take. A large jump? Or a slight shuffle of your feet? Remember, the fog is so thick you don’t know where you’ll land. All you know is the slope of where you are right now.

Determining how to step

But you’re pretty confident you know the mountainside. So you leap. And suddenly, you find yourself at the bottom of a crevasse. You’ve just stepped off a cliff and into a deep crevasse. So much further away from the peak of the mountain.

The more cautious readers might be thinking, ‘let’s shuffle along, so that doesn’t happen. You’ve avoided the crevasse. But each step you take is getting you nowhere fast. So you shuffle along, and it takes you forever to reach the top of the mountain. You might even give up along the way.

Suppose instead you stick to taking large jumps at each step. But in this scenario, there isn’t a large crevasse. Instead, you quickly leap up the mountain in record time, drastically reducing the amount of time it takes to reach the top.

These problems represent the trade-off between small and large learning rates. Do you want to risk falling off the mountainside, or do you want to risk taking forever to reach the top? Of course, both options have risks, but you get to choose.

Photo by Dylan Luder on Unsplash
Photo by Dylan Luder on Unsplash

Problems on the Mountain

Then, there are even scenarios where you’re stepping up the mountain, and you find yourself on a bump in the mountain, a small peak, which makes matters worse. Taking too small of steps means that you think you’ve found the highest point. There aren’t any higher points around you as you shuffle back and forth, so for all you know, this is the highest peak. Taking larger steps sends you off this little bump and helps you to explore more of the mountain.

But while large steps helped you out of the lower peak (local optima), the large steps have the same effect at the highest peak. They send you straight over the highest point. So you find yourself jumping back and forth over the peak but never going slow enough to settle on the very top.

Now a mountain range is a simplified representation of optimization in deep learning. But this analogy reveals another problem with optimization. This problem is increasingly likely as we break out from the real world of three dimensions and try to optimize the deep learning models with potentially endless dimensions.

Endless Saddles

This issue is saddle points. Think of the saddle on a horse. On a mountain, this looks like a flat section of the mountain that drop-offs on either side. The problem is the gradient. The gradient is zero or close to zero when you look at the slop during these sections. There is no slope. So you have no way to orientate yourself. You get stuck.

Photo by Weigler Godoy on Unsplash
Photo by Weigler Godoy on Unsplash

This lack of slope or near-zero slope is the problem with saddle points. As a result, you have no way to progress. Fortunately, completely flat areas are uncommon on mountain ranges and in deep learning optimization. Fortunately, the reality is that some degree of slope usually exists. But progressing when you reach this area is limited.

So these saddle points can cause the optimization process of finding the highest mountain peak to stall out. While these areas are not too common in low dimensions, these saddle points are increasingly likely when your dimensionality increases. So you are incredibly likely to encounter a saddle point between a few of your n dimensions, which can cause learning to stall. However, there are ways to mitigate this.


Selecting a Learning Rate

These issues bring us to alternative approaches to learning rates. Throughout the previous section, you may have been thinking to yourself, ‘Why not use an adjustable learning rate’? Sometimes I want to take a big step and sometimes a small step. This strategy is an improvement.

The remaining sections will discuss the options for different learning rates. These options help to mitigate the previous issues addressed. However, as you will see in future posts, the choice of optimizer also plays a huge role. Many of the optimizer options support the utilization of the second derivatives, which helps incorporate concepts like momentum into the convergence of your model.

So what is a learning rate for model optimization?

A learning rate is the step size, the degree to which the model learns. Larger rates train the model faster but don’t allow the model to converge effectively to more robust solutions. Conversely, lower rates slow learning but let your deep learning models inch even closer to an optimal configuration.

Optimization requires a learning rate. But the learning rate itself involves optimization. There is no one size fits all, and the choice can affect the model’s overall performance.

Fortunately, for Machine Learning packages such as TensorFlow, there is a lot of functionality already available, making it incredibly easy for you to test different variations and implement different learning rates within your model.

The remainder of this article will go over some alternatives to a fixed learning rate. While starting with a fixed rate or simple trying a few different values is fine, a more flexible solution can help to prevent some of the problem scenarios previously discussed.

Experiment Setup

For this example, I am using the diamonds dataset under a public domain license. This data set consists of a combination of categorical and numeric variables. For this article, I’ve removed the categorical features.

Diamonds

This dataset provides a good amount of data over many features. Therefore, is it unlikely that the space of the loss function is perfectly convex. Thus, creating the scenario of a mountain range with many small peaks, strange bumps, and sudden drop-offs. Ideally, for comparing the convergence with different learning rates.

The model used is a simple multi-layer neural network designed in TensorFlow. But as deep learning models rely on the same type of optimization, these experiments will be consistent across other deep learning models.

The target output is continuous, and the loss is set as the mean absolute error. The activation is a ReLU. The final model has two fully connected layers with fifty nodes per layer.

To visualize the results of the experiments, I am using the library Plotly. I’ve used this package extensively in the past. The code to replicate the plots within the article is from the code below. Details regarding Plotly are found in the linked article.

Automated Interactive Reports with Plotly and Python

Baseline Model

The first model acts as a baseline. The learning rate is fixed. Albeit, the learning rate here is already relatively low. Note how the learning rate is set within the optimizer Adam.

The training versus validation loss is shown below. The model converges steadily but over many epochs. The optimization may benefit from a higher learning rate and fast convergence, especially given more data.


Flexible Learning Rates and Schedulers

The following experiments use different learning schedulers. To apply these variable learning rates in TensorFlow, use the learning scheduler directly within the optimizer. Each optimizer within TensorFlow supports a learning rate scheduler instead of a constant learning rate. Perfect for optimizing deep learning models.

Exponential Decay

The first alternative shown is exponential decay. In the early stages of training the model, the model is massively under-fitted. So initially, a large learning rate is ideal for adjustments to the weights. The model will quickly adapt to the data but may predict quite generally at first. The model will maintain a high level of bias.

However, as the training progresses, the steps get smaller and smaller. This shrinking strategy is represented via the mountain ranges. You’ve found a high area, and you’re slowly circling the peak. As the steps become smaller, your chances of jumping over the highest point are reduced.

Setting a learning rate to decay in this manner allows the model to settle nicely into a final state. It provides a nice balance between the large steps, which save you time in training and fewer iterations and the smaller steps, which can squeeze out the last bits of performance for your deep learning model.

Code for Exponential Decay

Exponential decay is a pre-packaged scheduler in TensorFlow. All that is required is specifying the initial learning rate, decay rate, and the number of steps for the decay. Then, the whole scheduler object is passed as the learning rate.

Experiment

While the results are similar to the constant rate, you can see that the model settled at a slightly lower low on review. This change is squeezing out more performance from the same data and model. Only the learning rate is changing during training.

Cyclical Scheduler

While the gradual decreasing of steps provides a trade-off between fast learning and effective convergence, there is still the issue of local optimally. Having your deep learning model settle in a local optimum with a higher local optimum close by means you’re leaving some performance on the table.

Think of a collection of peaks tightly clustered together. Of course, the highest peak is in the mix, but you’ve just selected the first one you encountered.

Instead, the cyclical learning rate approach attempts to disrupt the convergence of the optimization process.

The strategy consists of starting with a low learning rate and slowly increasing, then jumping back up to a low learning rate again and then increasing again.

If your deep learning model is already at a steady optimal point, then these large steps late in the optimization will have little effect. If the optima are unstable, meaning there are other optima around it. Then these large steps will help the optimization method to explore those.

Code for Cyclical Scheduler

The cyclical scheduler is part of the TensorFlow-Addons package. A must know package is you build deep learning models in TensorFlow. Worth an entire article of its own.

Experiment

With this new learning scheduler, the model improves even more. There are also distinct points in training when the learning rate jumps around. These points are the sudden dips in the loss, improving the model overall.

Custom Scheduler

Despite the different options available for adjusting learning rates, you may want even more control over the optimization.

Fortunately, there are options available to create the exact learning scheduler that you want.

With the standard exponential decay learning scheduler, the rate of decreasing the learning rate is consistent. However, you may not want the decay to convergence consistently. Therefore, as you train your deep learning model, a smaller learning rate may be more appropriate as the model settles.

In these cases, squeezing the last bit of performance out of your deep learning model involved small tweaks to the weights only in the final few iterations and epochs of training.

Custom Scheduler Code

For the custom scheduler, the setup is slightly different. In contrast to the previous methods, a custom learning rate change is accomplished using callback during training. The callback accepts both the epochs and the learning rate as parameters and makes custom adjustments.

For this example, I am manually overriding the learning rate depending on the epochs. The changes are arbitrary but show how the loss of the model would act at different learning rates.

Experiment

Strangely enough, the custom learning rate produced the model with the lowest loss. Moreover, it also showcases the problems with a learning rate too high.

When the learning rate is set to the lowest, the model shows the best performance. However, at this stage, the validation loss fluctuates a lot more. These changes show some instability in this optimal state.

When the learning rate jumps back up, the immediate disadvantages are clear. The model is performing far worse than before and cannot converge even with many more epochs.

With the flexibility of custom schedulers, you can further improve the benefits of the exponential decay scheduler or the cyclical scheduler. One option is to apply the same cyclical method but update the max and initial rates as the model progress over more epochs.

Comparison of Methods

I’ve included a side by side comparison of the methods for you to review. To recreate the plot, the code below is run.

With the same y-axis scale, the slight differences in models are more apparent.

While these margins are relatively small, in large deep learning models, these small changes can mean significant changes to the model’s effectiveness.


Conclusion

Learning rates are a critical aspect of training your deep learning models. All the variations of deep learning and machine learning are based on this concept of optimization.

And at the core of the standard optimization methods is the learning rate.

Far too often, model developers rely on fixed learning rates missing out on better models. Attempting a few different learning rates is a slight improvement, but that is still limiting, especially when you consider all of the issues that may occur.

Take the time and observe how your deep learning models learn, how their performance changes over many iterations and see for yourself how to get more with intelligent design choices.

Take the time to understand what different learning rates mean for your deep learning models and how your performance changes as a result.


If you’re interested in reading articles about novel data science tools and understanding machine learning algorithms, consider following me on medium. I always include code in my articles that you can apply to your work!

If you’re interested in my writing and want to support me directly, please subscribe through the following link. This link ensures that I will receive a portion of your membership fees.

Join Medium with my referral link – Zachary Warnes


Related Articles