
Neural Networks are ubiquitous due to their ability to capture non-linear relationships in data very well. The article intends to explain popular neural network structures succinctly and in simple terms. The aim is to provide an intuition for how they work, but more importantly, why their structures might be useful for different problems. Each section has links to resources for in-depth explanations of the topics.
Contents
- Artificial Neural Network (ANN)
- Convolutional Neural Network (CNN)
- Recurrent Neural Network (RNN)
- Transformer Networks and Attention Mechanisms
- Concluding Remarks
Artificial Neural Network (ANN)
A neural network is a function that takes an input tensor X (with i rows and j columns) and maps it to an output y_hat in an attempt to estimate a true value y. This function has unknown parameters θ that affect the mapping. The optimal θ are those which minimise the error between y and y_hat.
The network itself is comprised of an input layer, hidden layers and an output layer; a simple network with one hidden layer is shown in Figure 1 below.

Each layer output acts as the input of the subsequent layer. The layer (L) has i nodes whose values are represented by the vector a_i^(L). This is a function of the output nodes of the previous layer a_j^(L-1) transformed by a non-linear activation function ϕ^(L) as defined in the equation below:

W_ij is a multiplicative factor (called the weights matrix) that linearly combines the output nodes of the previous layer.
In the simple network shown in Figure 1, the predicted output is given by:

where:

In this case, the unknown parameters of the model that need to be estimated are both weights matrices W_ij^(1) and W_1j^(2). These are determined by minimising the error between the output and the true value, J=Error(y_hat, y). Its gradient, J_i = Nabla(J) with respect to each component θ is used to nudge the values of closer to the optimum. The new values of θ, namely *θ = θ-ηJ_i (where η is a step size) are then used to compute a new y_hat**, and the process is repeated for a number of iterations. This process is known as back-propagation, and what is often referred to as learning.
A good resource that delves into further mathematical aspects of neural networks is Ian Goodfellow’s book. 3Blue1Brown’s videos on the other hand provide a nice visual for how these networks work.
Convolutional Neural Network (CNN)
Although feedforward neural networks are very useful, they struggle to extract spatial features from highly dimensional data. Convolutional Neural Networks (CNNs) have been designed to do precisely this. For an intuitive example, consider image classification of hand-written letters. Each letter is a picture represented by pixels as shown in Figure 2 below. However, since the letters are hand-written, the same letter way be distorted, shifted, shrunk or rotated.

As humans, we can recognise the spatial features in each image, and recognise that they both represent the letter X. However, a normal feedforward network would treat each pixel as an independent datapoint, and thus see both letters in Figure 2 as vastly different. A CNN rectifies this by extracting spatial features. Figure 3 shows the features from both images in Figure 2 that are representative of the letter X.

These features are extracted by means of a sliding window, called a kernel, that is initialised with random weights. At each window of the image, the pixel values are convolved (i.e. for example, dot product is taken) with the weights of the kernel. Then a process known as pooling is applied, where in each window, the maximum value is taken (effectively extracting the most important features). After the convolutional layers, standard feedforward layers are added to enable classification. An image of the entire process is shown in Figure 4 below.

The image example can be extended to other fields as well, for instance in Natural Language Processing (NLP), where instead of pixels, you have words and their embedding dimensions, as shown in Figure 5.

In this case, the kernel is uni-directional, and it extracts features from the context window of each word. This idea can be extended to larger segments of text, e.g. sentences where the representation becomes 3 dimensional. One key difference when dealing with text data is that sentences can vary in length, so typically a maximum sentence length is specified.
Alexander Amini’s video does a great job of explaining CNNs at a high level. For those interested in mathematical aspects, Ian Goodfellow’s book is recommended once again.
Recurrent Neural Network (RNN)
Unlike images or other data, text contains both spatial and sequential data. As a result, neither feedforward neural network nor CNNs provide a natural way of reading text data for example. Recurrent Neural Networks (RNNs) are structures that have feedback loops to enable a sequential reading of data.

The output state at each time step, h_t is thus given as a function the input x_t at that time step, and the output state of the previous cell. Therefore:

Further, text data has long term dependencies that need to be accounted for. For example, consider the task of predicting the next word in the following sentence:
I grew up in France, but I now live in Boston. I speak fluent …
In this case, an RNN would have to use the trigger word France to predict the next word, French. However, since the words are far from one another, during training the impact of France vanishes. As a result, certain activations functions that store memory, called Gated Cells, are required. One common Gated Cell is called Long Short Term Memory (LSTM). A schematic is shown in Figure 7.

The important thing to note is that the LSTM cell has four effects: forgetting irrelevant information from previous states, storing new information, updating the cell value (separately), and controlling the output information. This regulates the information that flows through the cell, which has the effect of better capturing long term dependencies.
Another consideration in text processing is that information flows two ways. For example, consider now the future context of our example sentence:
I grew up in France, but I now live in Boston. I speak fluent … because I lived in Frankfurt for 5 years.
A uni-directional RNN will only capture the past information, and thus likely predict French. However, when future context is considered, German would be more appropriate. Therefore, bidirectional RNN models, for example bi-LSTMs have also been developed and shown to perform better than uni-directional models.
It’s worth mentioning that, although RNNs provide a much more natural reading of text, they are typically really difficult to train, and thus CNNs, which only rely on spatial features, typically perform better on a multitude of tasks.
Ava Soleimani explains RNNs, and specifically LSTMs in an intuitive way here. For a detailed guide on LSTMs specifically, the reader is referred to Olah’s blog.
Transformer Networks and Attention Mechanisms
A key limitation of RNN models is that information gets lost in really long sequences. Although an LSTM is able to capture some memory, the structure itself is sequential, and therefore the output is not directly a function of each sequence element. Attention Mechanisms solve this problem by ensuring that each item in the sequence directly affects the output. A high level schematic of this is given in Figure 8 below.

This has the empirical effect of mimicking human attention, where the machine attends to parts of the sequence that are most informative, as shown in Figure 9.

Transformer Networks are models that rely on attention mechanisms and have no recurrent. Their structure enables independent calculations, making GPU parallelisation easy. Transformers are the SOTA model for NLP tasks. Interested readers are referred to Peter Bloem’s [blog](https://jalammar.github.io/illustrated-transformer/) and Jay Alammar’s blog.
Concluding Remarks
Neural Networks are structures that will become increasingly more important in Data Science with improvements to GPUs. Keeping up with the literature on them is nearly impossible, but it’s always useful to have ideas of how they work at a high level, as a starting point before further research.
I will be updating this article with more information on other structures (such as Graph NNs, Hierarchical NNs, etc…), so stay tuned for more!
All images by author except where indicated otherwise.