The world’s leading publication for data science, AI, and ML professionals.

3 Deep Learning Algorithms in under 5 minutes – Part 2 (Deep Sequential Models)

No traces of intimidating linear algebra detected…

Image by Erik Stein from Pixabay
Image by Erik Stein from Pixabay

In the last article, we looked at models that deal with non-time-series data. Time to turn our heads towards some other models. Here we will be discussing deep sequential models. They are predominantly used to process/predict time series data.

Link to Part 1, in case you missed it.

Simple Recurrent Neural Networks (RNNs)/Elman Networks

Simple recurrent neural networks (referred to also as RNNs) are to time-series problems as CNNs to computer vision. In a time-series problem, you feed a sequence of values to a model and ask it to predict the next n values of that sequence. RNNs go through each value of the sequence while building up memory of what it has seen which helps it to predict what the future will look like. (Learn more about RNNs [1] [2])

Analogy: New and improved secret train

I’ve played this game as a kid and you might know this by a different name. Kids are asked to stand in a line and you whisper the first kid in the line a random word. The kid should add an appropriate word to that word and whisper that to the next kid, and so on. By the time the message reaches the last kid, you should have an exciting story brewed up by kid’s imagination.

Enter simple RNNs! This is the crux of a RNN. It takes some input at time t – x(t) (new word from last kid) and a state from time t-1h(t-1) (previous words of the message) as inputs and produce an output – y(t) (previous message + new word from last kid + your new word).

Once you train a RNN, you can (but generally you won’t) keep predicting forever, because the prediction of time t (i.e. y(t)) becomes the input at t+1 (i.e. y(t)=x(t+1)). Here’s what an RNN looks like in real world.

How a RNN works in a sentiment analysis problem. It goes from one word to the other while producing a state (red ball). Finally there's a fully-connected network (FCN) that takes the last state and produces a label (positive /negative/neutral).
How a RNN works in a sentiment analysis problem. It goes from one word to the other while producing a state (red ball). Finally there’s a fully-connected network (FCN) that takes the last state and produces a label (positive /negative/neutral).

Applications

  • Time series prediction (e.g. weather / sales predictions)
  • Sentiment analysis – Given a movie/product review ( a sequence of words), predict if that’s negative/positive/neutral.
  • Language modelling – Given a part of a story, imagine the rest of the story / Generate code from descriptions

Long Short-term memory networks

LSTM is the cool new kid in RNN-ville. LSTM is a complicated beast than RNNs and able to remember things longer than RNNs. LSTMs would also go through each value of the sequence while building up memory of what it has seen which helps it to predict what the future will look like. But remember RNNs had a single state (that represented memory)? LSTMs have two states (one long-term and one short-term), thus the name LSTMs. (Learn more: LSTMs)

Analogy: Fast-food chain

All this explaining is making me hungry! So let’s go to a fast-food chain. This is a literal chain because, if you order a meal, one shop makes the burger, the other chips, and so on. In this fast-food drive through, you go to the first shop and say the following.

I need a burger with a toasted tiger bun and grilled chicken.

There’s one person that takes the order (green), and would send that information to the red person, let’s say he toasted the bun. When communicating with the blue person, he can drop the toasted part and say,

a burger with a tiger bun and grilled chicken

(we still need the grilled part because the next shop decides the sauce based on that). Then you drive to the next shop and say,

Add cheddar, large chips and I’m wearing a green t-shirt

Now, the green person knows his t-shirt color is completely irrelevant and drops that part. The shop also gets information from both red and blue from the previous shop. Next they would add the sauce, prepare the chips. The red person in the second shop will hold most of the order instructions, in case we need that later (if the customer complaints). But he’ll only say,

A burger and large chips

to the blue person as that’s all he needs to do his job. And finally, you get your order from the output terminal of the second shop.

The fast food chain. There are three people; green (input), red (cell state) and blue (output state). They also can discard certain information from the inputs you provide as well as discard information while processing them internally.
The fast food chain. There are three people; green (input), red (cell state) and blue (output state). They also can discard certain information from the inputs you provide as well as discard information while processing them internally.

LSTMs are not far from how this chain operated. At a given time t, it takes,

  • an input x(t) (the customer in the example),
  • an output state h(t-1) (the blue person from the previous shop) and
  • a cell state c(t-1) (the red person from the previous shop).

and produces,

  • an output state h(t) (the blue person in this shop) and
  • a cell state c(t) (the red person in this shop)

But rather than doing direct computations on these elements, the LSTM has a gating mechanism, that it can use to decide how much information from these elements it allows to flow through. For example, remember what happened when the customer said "I’m wearing a green t-shirt at the second shop", the green person (the input gate) dropped that information because it’s not important for the order. Another example is when the red person drops the part that the bun is toasted in the first shop. There are many gates in an LSTM cell. Namely,

  • An input gate (the green person) – discard information that’s not useful in the input.
  • A forget gate (part of red person) – discard information that’s not useful in the previous cell state
  • An output gate (part of blue person)- Discard information that’s not useful from cell state, to generate the output state

As you can see, the interactions are complicated. But the main takeaway is that,

An LSTM maintains two states (an output – short-term state and a cell state – long-term state) and uses gating to discard information when computing final and interim outputs.

Here’s what an LSTM would look like.

LSTM in real world. You can see that it's a complex labyrinth of connections. So don't try to understand how they all connect at this point. Understand the various entities involved. The red dashed ball represents an interim output computed by the LSTM cell
LSTM in real world. You can see that it’s a complex labyrinth of connections. So don’t try to understand how they all connect at this point. Understand the various entities involved. The red dashed ball represents an interim output computed by the LSTM cell

Applications

  • Same as RNNs

Gated Recurrent Units (GRUs)

Phew! LSTMs really took a toll on time I got left. GRU is a successor to LSTMs that simplifies the mechanics of LSTMs future without jeopardising performance too much. (Learn more: GRUs [1] [2])

Analogy: Fast-food chain v2.0

Not to be a food critic, but the fast-food chain we saw earlier looks pretty inefficient. Is there a way to make it more efficient? Here’s one way.

The new and improved fast-food chain. We no longer got the red person. This will cause less delays and help to get your good quicker.
The new and improved fast-food chain. We no longer got the red person. This will cause less delays and help to get your good quicker.
  1. Get rid of the red person (cell state). Now both long and short term memories are managed by the green person (output state).
  2. There’s only an input gate and an output gate (i.e. no forget gate)

You can think of GRU as an inbetweener between simple RNN and LSTMs. Here’s what a GRU looks like.

GRU in real world. Though it's not as complex as LSTMs still can be quite a bit to swallow. So don't try to understand how they all connect at this point. Understand the various entities involved. The red dashed ball represents an interim output computed by the GRU cell.
GRU in real world. Though it’s not as complex as LSTMs still can be quite a bit to swallow. So don’t try to understand how they all connect at this point. Understand the various entities involved. The red dashed ball represents an interim output computed by the GRU cell.

Applications:

  • Same as RNNs

Conclusion

We looked at simple RNNs, LSTMs and GRUs. Here are the main take-aways.

  • Simple RNNs – A simple model that goes from one time step to the other while generating an output state on every step (no gating mechanism)
  • LSTMs – Quite complicated. Has two states; a cell state (long-term) and an output state (short-term). It also has a gating mechanism to control how much information flows through the model.
  • GRUs – An compromise between RNNs and LSTMs. Has only one output state but still has the gating mechanism.

Next up would include the hottest topic in deep learning; the Transformers.


If you enjoy the stories I share about Data Science and machine learning, consider becoming a member!

Join Medium with my referral link – Thushan Ganegedara


Want to get better at deep networks and TensorFlow?

Checkout my work on the subject.

[1] (Book) TensorFlow 2 in Action – Manning

[2] (Video Course) Machine Translation in Python – DataCamp

[3] (Book) Natural Language processing in TensorFlow 1 — Packt


New! Join me on my new YouTube channel

If you are keen to see my videos on various machine learning/Deep Learning topics make sure to join DeepLearningHero.

Previous articles

Part 1: Feedforward models


Related Articles