Understanding Recurrent Neural Networks: The Preferred Neural Network for Time-Series Data

Published in

Towards Data Science

7 min readJun 26, 2017

Understanding Recurrent Neural Networks: The Preferred Neural Network for Time-Series Data

Artificial intelligence has been in the background for decades, kicking up dust in the distance, but never quite arriving. Well that era is over. In 2017, AI has broken through the dust cloud and arrived in a big way. But why? What’s the big deal all of a sudden? And what do recurrent neural networks have to do with it? Well, a lot, actually. Thanks to an ingenious form of short-term memory that is unheard of in conventional neural networks, today’s recurrent neural networks (RNNs) have been proving themselves as powerful predictive engines. When it comes to certain sequential machine learning tasks, such as speech recognition, RNNs are reaching levels of predictive accuracy, time and time again, that no other algorithm can match. However, the first generation of RNNs, back in the day, were not so hot. They suffered from a serious setback in their error-tweaking process that held up their progress for decades. Finally, a major breakthrough came in the late 90s that led to a new generation of far more accurate RNNs. Building on that breakthrough for nearly twenty years, developers refined and perfected their new RNNs until all-star apps such as Google Voice Search and Apple’s Siri started snatching them up to power key processes. Now recurrent networks are showing up everywhere, and are helping to ignite the AI renaissance that’s unfolding right now.

Neural Networks That Cling to the Past

Most artificial neural networks, such as feedforward neural networks, have no memory of the input they received just one moment ago. For example, if you provide a feedforward neural network with the sequence of letters “WISDOM,” when it gets to “D,” it has already forgotten that it just read “S.” That’s a big problem. No matter how hard you train it, it will always struggle to guess the most likely next character: “O.” This makes it a rather crappy candidate for certain tasks, such as speech recognition, that greatly benefit from the capacity to predict what’s coming next. Recurrent networks, on the other hand, do remember what they’ve just encountered, and at a remarkably sophisticated level.

Let’s take the example of the input “WISDOM” again and apply it to a recurrent network. The unit, or artificial neuron, of the RNN, upon receiving the “D” also takes as its input the character it received one moment ago, the “S.” In other words, it adds the immediate past to the present. This gives it the advantage of a limited short-term memory that, along with its training, provides enough context for guessing what the next character is most likely to be: “O.”

Tweaking and Re-tweaking

If you like to get into the weeds, this is where you get excited. Otherwise, get ready for a rough patch. But hang in there, it’s worth it. Like all artificial neural networks, the units of an RNN assign a matrix of weights to their multiple inputs, then apply a function to those weights to determine a single output. However, recurrent networks apply weights not only to their present inputs, but also to their inputs from a moment ago. Then they adjust the weights assigned to their present and past inputs through a process that involves two key concepts that you’ll definitely want to know if you really want to get into AI: gradient descent and backpropogation through time (BPTT).

Gradient Descent

One of the most famous algorithms in machine learning is known as gradient descent. Its primary virtue is its remarkable capacity to sidestep the dreaded “curse of dimensionality.” This issue plagues systems, such as neural networks, with far too many variables to make a brute-force calculation of their optimal values possible. Gradient descent, however, breaks the curse of dimensionality by zooming in on the local low-point, or local minimum, of the multi-dimensional error or cost function. This helps the system determine the tweaked value, or weight, to assign to each of the units in the network, bringing accuracy back in line.

Backpropogation Through Time

The RNN trains its units by adjusting their weights following a slight modification of a feedback process known as backpropogation. Okay, this is a weird concept. But if you’re into AI, you’ll learn to love it. The process of backpropogation works its way back, layer by layer, from the network’s final output, tweaking the weights of each unit, or artificial neuron, according to the unit’s calculated portion of the total output error. Got it? If so, get ready for one more layer of complexity. Recurrent neural networks use a heavier version of this process known as backpropogation through time (BPTT). This version extends the tweaking process to include the weight of the T-1 input values responsible for each unit’s memory of the prior moment.

Yikes: The Vanishing Gradient Problem

Despite enjoying some initial success with the help of gradient descent and BPTT, many artificial neural networks, including the first generation of RNNs, eventually ran out gas. Technically, they suffered a serious setback known as the vanishing gradient problem. Although the details fall way outside the scope of this sweeping overview, the basic idea is pretty straightforward. First, let’s look at the notion of a gradient. Like its simpler relative, the derivative, you can think of a gradient as a slope. In the context of training a deep neural network, the larger the gradient, the steeper the slope, the more quickly the system can roll downhill to the finish line and complete its training. But this is where developers ran into trouble — their slopes were too flat for fast training. This was particularly problematic in the first layers of their deep networks, which are the most critical when it comes to proper tweaking of memory units. Here the gradient values got so small, and their corresponding slopes so flat, that one could describe them as “vanishing,” thus the vanishing gradient problem. As the gradients got smaller and smaller, and thus flatter and flatter, the training times grew unbearably long. It was an error-correction nightmare without end.

The Big Breakthrough: Long Short-Term Memory

Finally, in the late 90s, a major breakthrough solved the vanishing descent problem and gave a second wind to recurrent network development. At the center of this new approach were units of long short-term memory (LSTM).

As weird as that sounds, the long and short of it is that LSTM made a world of difference in the field AI. These new units, or artificial neurons, like the standard short-term memory units of RNNs, remember their inputs from a moment ago. However, unlike standard RNN units, LSTMs can hang on to their memories, which have read/write properties akin to memory registers in a conventional computer. Yet LSTMs have analog, rather than digital, memory, making their functions differentiable. In other words, their curves are continuous and you can find the steepness of their slopes. So they are a good fit for the partial differential calculus involved in backpropogation and gradient descent.

Altogether, LSTMs can not only tweak their weights, but retain, delete, transform and otherwise control the inflow and outflow of their stored data according to the quirks of their training. Most importantly, LSTMs can cling to important error information for long enough to keep gradients relatively steep and thus training periods relatively short. This wipes out the vanishing gradient problem and greatly improves the accuracy of today’s LSTM-based recurrent networks. Thanks to this remarkable improvement in the RNN architecture, Google, Apple and many other leading companies, not to mention startups, are now using RNNs to power applications at the center of their businesses. In short, RNNs are suddenly a big deal.

What to Remember about RNNs

Let’s recap the highlights of these amazing memory machines. Recurrent neural networks, or RNNs, can remember their former inputs, which gives them a big edge over other artificial neural networks when it comes to sequential, context-sensitive tasks such as speech recognition. However, the first generation of RNNs hit the wall when it came to their capacity to correct for errors through the all-important twin processes of backpropogation and gradient descent. Known as the dreaded vanishing gradient problem, this stumbling block virtually halted progress in the field until 1997, when a major breakthrough introduced a vastly improved LSTM-based architecture to the field. The new approach, which effectively turned each unit in a recurrent network into an analogue computer, greatly increased accuracy and helped lead to the renaissance in AI we’re seeing all around us today.

If you have enjoyed this post, the biggest compliment you could give would be to share this with someone that you think would enjoy it!

Additionally, if you never want to miss a post, subscribe to my articles by clicking the green heart button below and subscribing! Thanks for reading, have a great day, and never stop learning!

Written by Jason Roell