Introduction to Neural Networks

A detailed overview of neural networks with a wealth of examples and simple imagery.

Matthew Stewart, PhD
Towards Data Science

--

“Your brain does not manufacture thoughts. Your thoughts shape neural networks.” — Deepak Chopra

This article is the first in a series of articles aimed at demystifying the theory behind neural networks and how to design and implement them. The article was designed to be a detailed and comprehensive introduction to neural networks that is accessible to a wide range of individuals: people who have little to no understanding of how a neural network works as well as those who are relatively well-versed in their uses, but perhaps not experts. In this article, I will cover the motivation and basics of neural networks. Future articles will go into more detailed topics about the design and optimization of neural networks and deep learning.

These tutorials are largely based on the notes and examples from multiple classes taught at Harvard and Stanford in the computer science and data science departments.

I hope you enjoy the article and learn something regardless of your prior understanding of neural networks. Let’s begin!

The motivation for Neural Networks

Untrained neural network models are much like new-born babies: They are created ignorant of the world (if considering tabula rasa epistemological theory), and it is only through exposure to the world, i.e. a posteriori knowledge, that their ignorance is slowly revised. Algorithms experience the world through data — by training a neural network on a relevant dataset, we seek to decrease its ignorance. The way we measure progress is by monitoring the error produced by the network.

Before delving into the world of neural networks, it is important to get an understanding of the motivation behind these networks and why they work. To do this, we have to talk a bit about logistic regression.

Methods that are centered around modeling and prediction of a quantitative response variable (e.g. number of taxi pickups, number of bike rentals) are called regressions (and Ridge, LASSO, etc.). When the response variable is categorical, then the problem is no longer called a regression problem but is instead labeled as a classification problem.

Let us consider a binary classification problem. The goal is to attempt to classify each observation into a category (such as a class or cluster) defined by Y, based on a set of predictor variables X.

Let’s say that we would like to predict whether a patient has heart disease based on features about the patient. The response variable here is categorical, there are finite outcomes, or more explicitly, binary since there are only two categories (yes/no).

There is a lot of features here — for now, we will only use the MaxHR variable.

To make this prediction, we would use a method known as logistic regression. Logistic regression addresses the problem of estimating a probability that someone has heart disease, P(y=1), given an input value X.

The logistic regression model uses a function, called the logistic function, to model P(y=1):

As a result, the model will predict P(y=1) with an S-shaped curve, which is the general shape of the logistic function.

β₀ shifts the curve right or left by c = − β₀ / β₁, whereas β₁ controls the steepness of the S-shaped curve.

Note that if β₁ is positive, then the predicted P(y=1) goes from zero for small values of X to one for large values of X and if β₁ is negative, then it has the opposite association.

This is summarized graphically below.

Now that we understand how to manipulate our logistic regression curve, we can play with some of the variables in order to get the sort of curve that we want.

We can change the β₀ value in order to move our offset.

We can change the β₁ value in order to distort our gradient.

Doing this by hand is pretty tedious and it is unlikely you will converge to the optimal value. To solve this problem we use a loss function in order to quantify the level of error that belongs to our current parameters. We then find the coefficients that minimize this loss function. For this binary classification, we can use a binary loss function to optimize our logistic regression.

So the parameters of the neural network have a relationship with the error the net produces, and when the parameters change, the error does, too. We change the parameters using an optimization algorithm called gradient descent, which is useful for finding the minimum of a function. We are seeking to minimize the error, which is also known as the loss function or the objective function.

So what is the point of what we just did? How does this relate to neural networks? Actually, what we just did is essentially the same procedure that is performed by neural network algorithms.

We only used one feature for our previous model. Instead, we can take multiple features and illustrate these in a network format. We have weights for each of the features and we also have a bias term, which together makes up our regression parameters. Depending on whether the problem is a classification or regression problem, the formulation will be slightly different.

When we talk about weights in neural networks, it is these regression parameters of our various incoming functions that we are discussing. This is then passed to an activation function which decides whether the result is significant enough to ‘fire’ the node. I will discuss different activation functions in more detail later in the next article.

So now we have developed a very simple network that consists of multiple logistic regression with four features.

We need to start with some arbitrary formulation of values in order for us to start updating and optimizing the parameters, which we will do by assessing the loss function after each update and performing gradient descent.

The first thing we do is set randomly selected weights. Most likely it will perform horribly — in our heart data, the model will give us the wrong answer.

We then ‘train’ the network by essentially punishing it for performing poorly.

However, merely telling the computer it is performing good or bad is not particularly helpful. You need to tell it how to change those weights in order for the performance of the model to improve.

We already know how to tell the computer it is performing well, we just need to consult our loss function. Now, the procedure is more complicated because we have 5 weights to deal with. I will only consider one weight but the procedure is analogous for all the weights.

Ideally, we want to know the value of w that gives the minimum ℒ (w).

To find the optimal point of a function ℒ (w), we can differentiate with respect to the weight and then set this equal to zero.

We then need to find the w that satisfies that equation. Sometimes there is no explicit solution for that.

A more flexible method is to start from any point and then determine which direction to go to reduce the loss (left or right in this case). Specifically, we can calculate the slope of the function at this point. We then shift to the right if the slope is negative or shift to the left if the slope is positive. This procedure is then repeated until convergence.

If the step is proportional to the slope then you avoid overshooting the minimum.

How do we perform this update? This is done using a method known as gradient descent, which was briefly mentioned earlier.

Gradient Descent

Gradient descent is an iterative method for finding the minimum of a function. There are various flavors of gradient descent, and I will discuss these in detail in the subsequent article. This blog post presents the different methods available to update the weights. For now, we will stick with the vanilla gradient descent algorithm, sometimes known as the delta rule.

We know that we want to go in the opposite direction of the derivative (since we are trying to ‘go away’ from the error) and we know we want to be making a step proportional to the derivative. This step is controlled by a parameter λ known as the learning rate. Our new weight is the addition of the old weight and the new step, whereby the step was derived from the loss function and how important our relevant parameter is in influencing the learning rate (hence the derivative).

A large learning rate means more weight is put on the derivative, such that large steps can be made for each iteration of the algorithm. A smaller learning rate means that less weight is put on the derivative, so smaller steps can be made for each iteration.

If the step size is too small, the algorithm will take a long time to converge, and if the step size is too large, the algorithm will continually miss the optimal parameter choice. Clearly, selecting the learning rate can be an important parameter when setting up a neural network.

There are various considerations to make for gradient descent:

  • We still need to derive the derivatives.
  • We need to know what the learning rate is or how to set it.
  • We need to avoid local minima.
  • Finally, the full loss function includes summing up all individual ‘errors’. This can be hundreds of thousands of examples.

Deriving the derivatives is nowadays done using automatic differentiation, so this is of little concern to us. However, deciding the learning rate is an important and complicated problem, which I will discuss later in the set of tutorials.

Local minimum can be very problematic for neural networks since the formulation of neural networks gives no guarantee that we will attain the global minimum.

Source

Getting stuck in a local minimum means we have a locally good optimization of our parameters, but there is a better optimization somewhere on our loss surface. Neural network loss surfaces can have many of these local optima, which is problematic for network optimization. See, for example, the loss surface illustrated below.

Example neural network loss surface. Source
Network getting stuck in local minima.
Network reach global minima.

How might we solve this problem? One suggestion is the use of batch and stochastic gradient descent. This idea sounds complicated, but the idea is simple — to use a batch (a subset) of data as opposed to the whole set of data, such that the loss surface is partially morphed during each iteration.

For each iteration k, the following loss (likelihood) function can be used to derive the derivatives:

which is an approximation to the full loss function. We can illustrate this with an example. First, we start off with the full loss (likelihood) surface, and our randomly assigned network weights provide us an initial value.

We then select a batch of data, perhaps 10% of the full dataset, and construct a new loss surface.

We then perform gradient descent on this batch and perform our update.

We are now in a new location. We select a new random subset of the full data set and again construct our loss surface.

We then perform gradient descent on this batch and perform our update.

We continue this procedure again with a new subset.

And perform our update.

This procedure continues for multiple iterations.

Until the network begins to converge to the global minimum.

We now have sufficient knowledge in our tool kit to go about building our first neural network.

Artificial Neural Network (ANN)

Now that we understand how logistic regression works, how we can assess the performance of our network, and how we can update the network to improve our performance, we can go about building a neural network.

First, I want us to understand why neural networks are called neural networks. You have probably heard that it is because they mimic the structure of neurons, the cells present in the brain. The structure of a neuron looks a lot more complicated than a neural network, but the functioning is similar.

Source

The way an actual neuron works involves the accumulation of electric potential, which when exceeding a particular value causes the pre-synaptic neuron to discharge across the axon and stimulate the post-synaptic neuron.

Humans have billions of neurons which are interconnected and can produce incredibly complex firing patterns. The capabilities of the human brain are incredible compared to what we can do even with state-of-the-art neural networks. Due to this, we will likely not see neural networks mimicking the function of the human brain anytime soon.

We can draw a neural diagram that makes the analogy between the neuron structure and the artificial neurons in a neural network.

Source

Given the capabilities of the human brain, it should be apparent that the capabilities of artificial neural networks are fairly limitless in scope — especially once we begin to link these to sensors, actuators, as well as the wealth of the internet — which explains their ubiquity in the world despite the fact we are in the relatively nascent phases of their development.

After all, a reductionist could argue that humans are merely an aggregation of neural networks connected to sensors and actuators through the various parts of the nervous system.

Now let’s imagine that we have multiple features. Each of the features is passed through something called an affine transformation, which is basically an addition (or subtraction) and/or multiplication. This gives us something resembling a regression equation. The affine transformation becomes important when we have multiple nodes converging at a node in a multilayer perceptron.

We then pass this result through our activation function, which gives us some form of probability. This probability determines whether the neuron will fire — our result can then be plugged into our loss function in order to assess the performance of the algorithm.

From now, I will abstract the affine and activation blocks into a single block. However, be clear that the affine transformation is the amalgamation of the outputs from upstream nodes and the summed output is then passed to an activation function, which assesses the probability to determine whether it’s the quantiative value (the probability) sufficient to make the neuron fire.

We can now go back to our first example with our heart disease data. We can take two logistic regressions and merge them together. The individual logistic regressions look like the below case:

When we connect these two networks, we obtain a network with increased flexibility due to the increased number of degrees of freedom.

This illustrates the power of neural networks quite well, we are able to string together (sum) multiple functions such that with a large number of functions — which come from a large number of neurons — we are able to produce highly non-linear functions. With a large enough set of neurons, continuous functions of arbitrary complexity can be produced.

This is a very simple example of a neural network, however, we see that we already run into a problem even with such a simple network. How are we supposed to update the value of our weights?

We need to be able to calculate the derivatives of the loss function with respect to these weights. In order to learn the missing weights, w₁, w₂, and w₃, we need to utilize something known as backpropagation.

Backpropagation

Backpropagation is the central mechanism by which neural networks learn. It is the messenger telling the network whether or not the network made a mistake during prediction. The discovery of backpropagation is one of the most important milestones in the whole of neural network research.

To propagate is to transmit something (e.g. light, sound) in a particular direction or through a particular medium. When we discuss backpropagation in the context of neural networks, we are talking about the transmission of information, and that information relates to the error produced by the neural network when they make a guess about data.

During prediction, a neural network propagates signal forward through the nodes of the network until it reaches the output layer where a decision is made. The network then backpropagates information about this error backward through the network such that it can alter each of the parameters.

Image result for backpropagation
Source

Backpropagation is the way in which we calculate the derivatives for each of the parameters in the network, which is necessary in order to perform gradient descent. This is an important distinction to make as it can be easy to mix up backpropagation and gradient descent. Backpropagation is performed first in order to gain the information necessary to perform gradient descent.

You might have noticed that we still need to calculate the derivatives. Computers cannot differentiate, but a function library can be built in order to do this without the network designer needing to get involved, it abstracts the process for us. This is known as automatic differentiation. Below is an example of this.

We could do it by hand like this, and then change it for every network architecture and for each node.

Or we can write a function library that is inherently linked to the architecture such that the procedure is abstracted and updates automatically as the network architecture is updated.

If you really want to understand how useful this abstracted automatic differentiation process is, try making a multilayer neural network with half a dozen nodes and writing the code to implement backpropagation (if anyone has the patience and grit to do this, kudos to you).

More Complex Networks

Having a network with two nodes is not particularly useful for most applications. Typically, we use neural networks to approximate complex functions that cannot be easily described by traditional methods.

Neural networks are special as they follow something called the universal approximation theorem. This theorem states that, given an infinite amount of neurons in a neural network, an arbitrarily complex continuous function can be represented exactly. This is quite a profound statement, as it means that, given enough computational power, we can approximate essentially any function.

Obviously, in practice, there are several issues with this idea. Firstly, we are limited by the data we have available to us, which limits our potential accuracy in predicting categories or estimating values. Secondly, we are limited by our computational power. It is fairly easy to design a network that far exceeds the capabilities of even the most powerful supercomputers in the world.

The trick is to design a network architecture such that we are able to achieve high accuracy using relatively little computational power, with minimal data.

What is even more impressive is that one hidden layer is enough to represent an approximation of any function to an arbitrary degree of accuracy.

So why do people use multilayer neural networks if one layer is enough?

A neural architecture with multiple hidden layers.

The answer is simple. This network would need to have a neural architecture that is very wide since shallow networks require (exponentially) more width than a deep network. Furthermore, shallow networks have a higher affinity for overfitting.

This is the stimulus behind why the field of deep learning exists (deep referring to the multiple layers of a neural network) and dominates contemporary research literature in machine learning and most fields involving data classification and prediction.

Summary

This article discussed the motivation and background surrounding neural networks and outlined how they can be trained. We talked about loss functions, error propagation, activation functions, and network architectures. The diagram below provides a great summary of all of the concepts discussed and how they are interconnected.

Neural networks step-by-step. Source

The knowledge from this article will provide us with a strong basis from which we can build upon in future articles discussing how to improve the performance of neural networks and use them for deep learning applications.

Newsletter

For updates on new blog posts and extra content, sign up for my newsletter.

References

J. Nocedal y S. Wright, “Numerical optimization”, Springer, 1999

TLDR: J. Bullinaria, “Learning with Momentum, Conjugate Gradient Learning”, 2015

--

--

ML Postdoc @Harvard | Environmental + Data Science PhD @Harvard | ML consultant @Critical Future | Blogger @TDS | Content Creator @EdX. https://mpstewart.io