A Soft Introduction to Neural Networks

Published in

Towards Data Science

9 min readJun 21, 2017

Over the last few years, neural networks have become synonymous with Machine Learning. Recently, we have been able to make neural nets which can produce life-like faces, transfer dynamic art style, and even “age” a picture of a person by years. If you want to know how the intuition behind neural networks work, but have no idea where to start, you’ve come to the right place!

By the end of the tutorial, you can expect to understand the basics of how and why neural networks work. You will also create a very simple neural network from scratch using only NumPy. Experience in linear algebra/multivariate calculus and Python programming are prerequisites to this tutorial.

The Forward Pass

The project of any neural network is to generate a prediction based on some input. This sounds very vague, but this is because the task of every neural network is slightly different, so this definition is a one-size-fits-all. Some examples of inputs might be an image in the form of a three-dimensional tensor, an input feature vector of properties, or a vector of dictionary embeddings; some examples of outputs might be a predicted class label (classification) or a predicted non-discrete value (regression).

Give Me the Data

Let’s take a simple example to make this concept less vague. We have this toy dataset of sixteen data points, each with four features and two possible class labels (0 or 1):

And here is the dataset in code:

import numpy as np

X = np.array([[1, 1, 0, 1], [0, 1, 0, 1], [0, 1, 0, 1], [1, 0, 1, 0], [0, 1, 1, 0], [1, 0, 1, 1], [0, 0, 0, 0], [1, 1, 1, 0], [0, 0, 1, 1], [1, 1, 0, 1], [0, 0, 1, 0], [1, 0, 0, 0], [1, 1, 1, 1], [0, 1, 1, 1], [1, 0, 0, 1], [1, 0, 0, 1]])y = np.array([[0], [0], [0], [1], [1], [1], [0], [1], [1], [0], [1], [0], [1], [1], [0], [0]])

As you can see, the class label is exactly the same as the value for the third feature. We want to make the network accurately transform the features into predicted class label probabilities — that is, the output should have the probabilities of each of the two possible classes. If our network is accurate, we will have a high predicted probability for the correct class and a low predicted probability for the incorrect classes.

One important note before we continue: we can’t use all of X to train our network. This is because we expect our network to do well on data it has seen; we are really interested in seeing how it does on data it’s hasn’t seen. We want two types of unseen data: a dataset upon which we can periodically evaluate our network, known as the validation set, and a dataset which simulates “real world data” that we evaluate on only once, known as the test set. The rest of the data used for training is known as the training set. Here’s the code to perform that split:

X_train = X[:8, :]
X_val = X[8:12, :]
X_test = X[12:16, :]

y_train = y[:8]
y_val = y[8:12]
y_test = y[12:16]

Network Layers

The word “layers” get thrown around in the context of Machine Learning quite a bit. Truth be told, it is nothing more than a mathematical operation that often times involves a multiplicative weight matrix and an additive bias vector.

Let’s take just one of the data points, which is a set of four feature values. We want to turn the input data point, which is a matrix of dimension 1 by 4, into a vector of label probabilities, which is a vector of dimension 1 by 2. To do this, we simply have to multiply the input by a 4 by 2 weight matrix. That’s it! Well, not quite — we also add a 1 by 2 bias vector to the product too, just to add an additional parameter to learn.

Here’s the code:

W = np.random.randn(4, 2)
b = np.zeros(2)

linear_cache = {}
def linear(input):
    output = np.matmul(input, W) + b
    linear_cache["input"] = input
    return output

(Don’t worry about the cache for now; that will be explained in the next section)

A really cool result of the way this operation works is that we don’t need to give the layer just one data point as an input! If we take, say, four data points instead, the input is 4 by 4 and the output is 4 by 2 (four sets of class probabilities). A subset of your dataset used at train-time is called a batch.

The class with the highest probability is the network’s guess. Great! Except that the network would be horribly wrong — all the weights are random, after all! Said concisely: now that we have the network guessing at all, we need to get it guessing correctly.

The Backward Pass

To make the network’s guesses better, we need to find “good” values for the weight and bias so that the amount of correct guesses is maximized. We do this by finding one scalar value that represents how “wrong” our guess was, known as the loss value, and minimizing it using multivariate calculus.

The Loss Function (or, “Taking an L”)

Before we compute a scalar loss, we want to turn our arbitrarily valued “probabilities” into a proper probability distribution. We do this by computing the softmax function over the values:

for every value f_j in the linear output. To get the scalar loss value, compute the cross-entropy of the correct class:

for the correct class value f_i. Here’s the code:

softmax_cache = {}
def softmax_cross_entropy(input, y):
    batch_size = input.shape[1]
    indeces = np.arange(batch_size)

    exp = np.exp(input)
    norm = (exp.T / np.sum(exp, axis=1)).T
    softmax_cache["norm"], softmax_cache["y"], softmax_cache["indeces"] = norm, y, indeces

    losses = -np.log(norm[indeces, y])
    return np.sum(losses)/batch_size

Backpropagation

We are now in the business of minimizing the loss value, and we need to change the weights to do this. We use the beauty of chain rule to accomplish this.

Let’s take a super simple example to demonstrate this concept. Let’s define a function: L(x) = 3x² + 2. Here is function as a graph:

We want to take the derivative of L with respect to the x. We can do this by treating each graph node at the subfunction, taking the partial derivative at each of these subfunctions, and multiplying by the incoming derivative:

Let’s break this down. The derivative of L with respect to L is just 1. The derivative of n + 2 is 1; multiplied by 1 (the incoming gradient) is 1. The derivative 3n is 3; multiplied by 1 is 3. And finally, the derivative of n² is 2n; multiplied by 3 is 6n. Indeed, we know that the derivative of 3x² +2 is 6x. Finding the derivative with respect to parameters by recursive use of the chain rule is called backpropagation. We can do the same thing with our complex neural network by first drawing out the network as a graph:

Now we can backpropagate through the network to find the derivative (known as a gradient when the variable is non-scalar) with respect to the weight and bias:

The backpropagation through this graph has the same underlying rules as in the graph above. However, there’s still quite a bit going on here, so let’s again break it down. The second gradient is the gradient for the entire softmax/cross-entropy function; it basically state that the derivative is the same as the output of the softmax from the forward pass, except that we subtract 1 from the correct class. The derivation for the gradient is not in the scope of this article, but you can read more about it here. Also, b is not the same dimension as q, so we need to sum across a dimension to make sure the dimensions match. Finally, x^T is the transpose of the input matrix X. Hopefully is is now clear why we needed to cache certain variables: some values used/computed in the forward pass are needed to compute gradients in the backward pass. Here is the code for the backward passes:

def softmax_cross_entropy_backward():
    norm, y, indeces = softmax_cache["norm"], softmax_cache["y"], softmax_cache["indeces"]
    dloss = norm
    dloss[indeces, y] -= 1
    return dloss

def linear_backward(dout):
    input = linear_cache["input"]
    dW = np.matmul(input.T, dout)
    db = np.sum(dout, axis=0)
    return dW, db

Update Rules

Backpropagation with respect to the parameters gives is the steepest direction of change. So, if we move in the opposite direction, then we will reduce the value of the function. The simplest algorithm to move in the direction of steepest decrease is called gradient descent — multiplying the gradient by some value alpha and subtracting it from the parameter:

The multiplicative value alpha is very important, because if it’s too large then we may shoot past the minimum, but if it’s too small we may never converge. The size of step that we take in a weight update is known as the learning rate. The learning rate is a hyperparameter, a value that we can vary to yield different results in our trained network.

Note the section commented “Parameter updates” in the code for our training regime:

def eval_accuracy(output, target):
    pred = np.argmax(output, axis=1)
    target = np.reshape(target, (target.shape[0]))
    correct = np.sum(pred == target)
    accuracy = correct / pred.shape[0] * 100
    return accuracy

# Training regime
for i in range(4000):
    indeces = np.random.choice(X_train.shape[0], 4)
    batch = X_train[indeces, :]
    target = y_train[indeces]

    # Forward Pass
    linear_output = linear(batch)
    loss = softmax_cross_entropy(linear_output, target)

    # Backward Pass
    dloss = softmax_cross_entropy_backward()
    dW, db = linear_backward(dloss)

    # Weight updates
    W -= 1e-2 * dW
    b -= 1e-2 * db

    # Evaluation
    if (i+1) % 100 == 0:
        accuracy = eval_accuracy(linear_output, target)
        print ("Training Accuracy: %f" % accuracy)

    if (i+1) % 500 == 0:
        accuracy = eval_accuracy(linear(X_val), y_val)
        print("Validation Accuracy: %f" % accuracy)

# Test evaluation
accuracy = eval_accuracy(linear(X_test), y_test)
print("Test Accuracy: %f" % accuracy)

Here is the complete Gist with the entire code:

It can also be found here: https://gist.github.com/ShubhangDesai/72023174e0d54f8fb60ed87a3a58ec7c

That’s it! We have a very simple, one-hidden-layer neural network that can be trained to yield a 100% validation and test accuracy on our toy dataset.

Next Steps

The type of layer that we have used is called a linear or fully-connected layer. I’ll be writing more articles this summer about other types of layers and network architectures, as well as articles on way cooler applications than toy datasets. Be on the lookout for those!

There are some awesome online courses on Machine Learning available for free. The Coursera ML Course is a classic, of course, but I’d also recommend the course materials for Stanford CS 231n. It is a Master’s level course which I had the privilege to take last quarter; it is incredibly taught and intensive.

I would also recommend looking into some of the beginner tutorials for TensorFlow and PyTorch. They’re the most popular open-source deep learning libraries, and rightfully so. The tutorials are in-depth and easy to follow.

Now that you have the basics of neural networks down, you have been officially inducted into an exciting and fast-changing field. Go, explore the field! I really hope that Machine Learning inspires as much awe and amazement in you as it has in me.

If you liked this article, please be sure to give me a clap and follow me to see my future articles in your feed! Also, check out my personal blog and follow my Twitter for more of my musings.