The world’s leading publication for data science, AI, and ML professionals.

Optimization Methods in Deep Learning

Breakdown the Fundamentals

In deep learning, generally, to approach the optimal value, gradient descent is applied to the weights, and optimization is achieved by running many many epochs with large datasets. The process is computationally intensive, and more than often you need to fit a dataset multiple times, thus, an efficient and fast optimization method helping you to get to the optimal more quickly is of great importance.

In this article, we will go through some general techniques applied in the Optimization process.

Photo by Tim Carey on Unsplash
Photo by Tim Carey on Unsplash

Weighted Average

Firstly, let’s learn the idea of weighted average, it smoothes values by given weights to past values.

Where t denotes the current timestamp, t-1 the last, and θ_t is the current value.

If you substitute the equation recurrently, you would get:

You can see that v is a weighted average of past values. If you get confused, let’s get into a concrete example by generating some random data points. (full implementation is in my Github Repo)

We can see that the higher β is, the smoother the line becomes. In fact, the average values roughly equals to 1/(1-β)

Which means given β = .9, it roughly averages the value of the past 10 events.

And this weighted average is the technic used in our first optimization method, which is momentum.

Momentum

The momentum in gradient descent borrow the idea of weighted average, the equations as shown as below:

Where v_0 = 0, and the corresponding v have the same shape as dW and db .

One problem in the raw gradient descent is that the oscillation involved in the optimization process. Taken from the sin function above, you can see that the original function oscillates violently, while after weighted average added, the oscillation is smoothed out. The same intuition applies to gradient descent with momentum, in the high dimensional parameter space, oscillation to the irrelevant direction is smoothed out, and the downhill to the minimization is accelerated.

RMSProp

It makes use of the squared weighted average of past gradients.

Where S_t = 0 and ϵ is a very small value to avoid dividing by 0. An intuitive understanding of this is that as the squared value becomes large, the division would be small as if friction is added to the ball rolling downhill to control the speed, especially when it approximates the optimal value, as S_t grows bigger and bigger, the steps taken around the optimal would be smaller and smaller so that it would be easier to reach optimal.

Adam

So far the most powerful optimization method known combines the traits of both RMSProp and Momentum, which makes the update equations a little bit complex:

Based on RMSProp, the numerator dW, db are substituted with corrected momentum. Here the correction process mainly takes effect at the initial phase: when β^t is large (note that β is less than 1), the corresponding value would be augmented, otherwise it would be very small at the beginning phase.

In the implementation process, take note that both S and V would share the same shape with either W or b. And all the methods introduced here are efficient at memory usage because, at each step t, only the value of V_t and S_t are stored, no other intermediate values required.


Related Articles