[ Back to Basics ] Deriving Back Propagation on simple RNN/LSTM (feat. Aidan Gomez)

Jae Duk Seo
Towards Data Science
10 min readMay 2, 2018

--

Gif from this website

Aidan Gomez did an amazing job explaining in details of how back-propagation LSTM works in general. And I did mine as well on this post. However, my friend Abe Kang had one question regarding my post.

So for today, I wanted to review some fundamental concepts of back propagation while making connection between my post and Aidan’s Post.

Finally all of our labs at my school are getting renovated hence I do not have any whiteboard to write on so I’ll try my best to write neatly as possible on my note pad.

Fast Recap of Derivatives

Just to start off, lets practice some derivatives before we move on. While looking at the above image if everything makes sense to you, please move on, else I would recommend you to review more about derivatives by reading this post.

Also please note that derivative of f(x) = tanh(x) is 1-tanh(x)² which can be re written as 1-f(x)².

Fast Recap of Chain Rule

Image from this website

So chain rule is super simple as well, above we can see that we want to take the derivative respect to x of function y. However, function y itself does not contain the variable x, so we are not going to be able to take derivative directly. Thankfully function y contains the variable u, and u contains x. So we are able to take derivative respect to x of function y eventually, thanks to chain rule.

Fast Recap of Multi-variable Derivative

Blue → Derivative Respect to variable x
Red → Derivative Respect to variable Out

Now lets just review derivatives with Multi-Variables, it is simply taking the derivative independently of each terms. Also for now please ignore the names of the variables (e.g. x or out) it does not have significant meaning. Lets see another example of this.

Left Image from this website, Right Image from this website

Since the derivative of ln(x) is 1/x, if we take the partial derivative respect to x and y independently for each equation we would get something like above. Finally, lets look at a example of multi-variable function inside a logistic sigmoid function.

Blue → Derivative Respect to variable x
Red → Derivative Respect to variable Out

Basic Building Block of Recurrent Neural Network

Modified Image from this website

So above image tells us the basic building blocks of simple recurrent neural network with four state (denoted by symbol s), three inputs (denoted by symbol x) and one error (denoted by symbol E). Now lets put that into mathematical equation, but without any activation function.

Green Box → Derivative of Error Respect to State 3
Blue Box → Derivative of Error Respect to State 2
Red Box → Derivative of Error Respect to State 1

Now lets take a look at the derivative respect to each state, pretty simple right? Using the chain rule as well as multi-variable derivatives we are able to take derivative respect to each state to update the weight (U) accordingly. (Please note that I didn’t take the derivative respect to state 0, as well as the fact that we are using L2 loss function. ). Now lets finish the task by taking derivative respect to inputs.

Green Box → Derivative of Error Respect to Input 3
Blue Box → Derivative of Error Respect to Input 2
Red Box → Derivative of Error Respect to Input 1

Feed-Forward operation of LSTM

Right Image from this website

Feed Forward operation lookd very complicated, however, when we do the actual math it is very simple. Left image is the graphical representation of LSTM and right image is the mathematical representation from Aidan Gomez. Now lets actually write down the math for state 1 and 2 (Please note that I use the term state and timestamp interchangeably for this post). And I’ll use Aidan’s notation since it will make it easier to understand.

Again, please take note of the fact that we are using the L2 cost function. Now lets actually take a look at a numerical example of this, and below two images are from Aidan’s blog post. (Please note in my notes I did not write the bias terms for simplicity. )

Also Aidan have use state 0 and 1, while I used state 1 and 2.

Back Propagation at Time Stamp 2

Left Image Back Propagation equations by Aidan

There are two things to note here….

1. Green Line → Please remember the derivative of Tanh() can be rewritten as such. If you do not remember why, please scroll up.

2. I did not perform derivative respect to i(2), and f(2). The reason is because those two terms have very similar derivative structure with a(2). I’ll try to explain why in details below.

If we observe how state(t) is calculated we can see that the terms a(t), i(t), f(t) and state(t-1) are involved. So when we take the derivative respect to the variable a(t), we can know that it would be very similar to taking derivative respect to i(t). However, there is one term that we need to take a more deeper look into and that is o(t). Since, the term gets used after we calculate state(t) the derivative is also different.

Orange Box → Derivative respect to o(2)

We can clearly see that there are some difference between the derivative equation when compared to a(2).

With those ideas in-mind we can see that deriving back-propagation at time stamp 2 is not that hard, since it is the most outer layer of our LSTM.

Back Propagation at Time Stamp 1

Green Box → Derivative Portion Directly from Error Function at Time Stamp 1
Blue Box → Derivative Portion from Time Stamp 2
Red Box → Summarizing the Symbol to Beta

The above image is the back-propgation operation when time stamp is 1. Now again, I am using Aidan’s notation, however there is one portion that my friend have pointed out.

Blue Box → Why do need a term from t+1 while calculating the derivative at t?

It definitely seems confusing at first site, so lets take a deeper look at feed forward operation.

Blue Box → Where State 1 Shows up when Time Stamp is 1(Left Image) and when time stamp is 2 (Right Image)

Please take note of the fact that I am using the term ‘time stamp’ and ‘state’ interchangeably. Anyways we can observe that while taking the derivative respect to state 1. There is going to be two equations that we would need.

  1. First, we would need the gradient when time stamp is 1 (Green Box)
  2. Second, we need the gradient when time stamp is 2. (Blue Box)

Lets recap where the blue box terms arise from. (Derivative respect to state(2))

So after the above derivative we still need to add the f(2) term. The reason for that can be explained below.

Remember during the feed forward operation when time stamp is 2, we have multiplied state(1) with f(2). Hence if we take the derivative respect to state(1) we would need to multiply that term as well.

Back Propagation respect to O at time stamp 1

Orange Box → Derivative respect to term O

Now lets think about the back propagation process respect to the term O. When compared to the back propagation we did in the prior section there is one interesting fact that we need to take note of.

Orange Box → Aidan’s Equation for Back propagation

There is no term with t+1. Thats one clear difference between while taking gradient respect to state(t). In other words…..

a) While taking gradient respect to state(t) → WE NEED TO CONSIDER GRADIENT FROM FUTURE TIME STAMP STATE(t+1)

b) While taking gradient respect to o(t) → WE DO NOT NEED TO CONSIDER GRADIENT FROM FUTUE TIME STAMP STATE(t+1)

This can be seen again while taking the back propagation respect a time stamp 1.

Blue Box → Where State(2) (Other words State(t+1) exists
Yellow Box → Derivatives that are effective by State(t+1)
Black Box → Derivatives that are not effected by State(t+1)

To further confirm this idea lets take a look at my previous blog post, where I was more detailed with the back propagation.

Image from this post

Blue Box → Derivative respect to o(Denoted by W(o)) from cost function at time stamp 1
Green Box → Derivative respect to o(Denoted by W(o)) from cost function at time stamp 2

So far so good, we can understand the derivatives, but now lets actually take a look at the math.

Please ignore the green/red box for now. Okay, thats the math we have but now lets compare to the equation that we have got from the previous post to our current post. To compare things easily lets expand the equation, as seen below.

Red Box → Repeated terms from expanding the equation
Green Box → Derivative from time stamp 2

Now to put things all together, all of the green box elements we saw from above, goes into the black star variable. And all of the red box elements are exactly the same. (expect for the input x symbols).

And what we can confirm here is the fact that when taking derivative respect to the o term (seen below) we do not need to consider the derivative from state(t+1).

Final words

Hope this post can clear some confusion, however I know that my explanation in English isn’t the best. So if you have any questions please comment down below.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content.

Reference

  1. Backpropogating an LSTM: A Numerical Example — Aidan Gomez — Medium. (2016). Medium. Retrieved 1 May 2018, from https://medium.com/@aidangomez/let-s-do-this-f9b699de31d9
  2. Calculus Review: Derivative Rules — Magoosh High School Blog. (2017). Magoosh High School Blog. Retrieved 1 May 2018, from https://magoosh.com/hs/ap-calculus/2017/calculus-review-derivative-rules/
  3. Rules of calculus — multivariate. (2018). Columbia.edu. Retrieved 1 May 2018, from http://www.columbia.edu/itc/sipa/math/calc_rules_multivar.html
  4. The derivative of lnx and examples — MathBootCamps. (2016). MathBootCamps. Retrieved 1 May 2018, from https://www.mathbootcamps.com/derivative-natural-log-lnx/
  5. Electricity price forecasting with Recurrent Neural Networks. (2018). Slideshare.net. Retrieved 1 May 2018, from https://www.slideshare.net/TaegyunJeon1/electricity-price-forecasting-with-recurrent-neural-networks
  6. Differentiation 3 Basic Rules of Differentiation The Product and Quotient Rules The Chain Rule Marginal Functions in Economics Higher-Order Derivatives. — ppt download. (2018). Slideplayer.com. Retrieved 1 May 2018, from http://slideplayer.com/slide/10776187/
  7. Backpropogating an LSTM: A Numerical Example — Aidan Gomez — Medium. (2016). Medium. Retrieved 2 May 2018, from https://medium.com/@aidangomez/let-s-do-this-f9b699de31d9
  8. Only Numpy: Deriving Forward feed and Back Propagation in Long Short Term Memory (LSTM) part 1. (2018). Becoming Human: Artificial Intelligence Magazine. Retrieved 2 May 2018, from https://becominghuman.ai/only-numpy-deriving-forward-feed-and-back-propagation-in-long-short-term-memory-lstm-part-1-4ee82c14a652

--

--

Exploring the intersection of AI, deep learning, and art. Passionate about pushing the boundaries of multi-media production and beyond. #AIArt