The world’s leading publication for data science, AI, and ML professionals.

Multi-Layer Perceptrons

Get insight into the first fully-functional model of neural networks

Explained and Illustrated

In the previous article we talked about perceptrons as one of the earliest models of neural networks. As we have seen, single perceptrons are limited in their computational power since they can solve only linearly separable problems.

In this article we will discuss multi-layer perceptrons (MLPs), which are networks consisting of multiple layers of perceptrons and are much more powerful than single-layer perceptrons. We will see how these networks operate and how to use them to solve complex tasks such as image classification.

Definitions and Notations

A multi-layer perceptron (MLP) is a neural network that has at least three layers: an input layer, an hidden layer and an output layer. Each layer operates on the outputs of its preceding layer:

The MLP architecture
The MLP architecture

We will use the following notations:

  • aᵢˡ is the activation (output) of neuron i in layer l
  • wᵢⱼˡ is the weight of the connection from neuron j in layer l-1 to neuron i in layer l
  • bᵢˡ is the bias term of neuron i in layer l

The intermediate layers between the input and the output are called hidden layers since they are not visible outside of the network (they form the "internal brain" of the network).

The input layer is typically not counted in the number of layers in the network. For example, a 3-layer network has one input layer, two hidden layers, and an output layer.

Forward Propagation

Forward propagation is the process where the input data is fed through the network in a forward direction, layer-by-layer, until it generates the output.

The activations of the neurons during the forward propagation phase are computed similar to how the activation of a single perceptron is computed.

For example, let’s look at neuron i in layer l. The activation of this neuron is computed in two steps:

  1. We first compute the net input of the neuron as the weighted sum of its incoming inputs plus its bias:
The net input of neuron i in layer l
The net input of neuron i in layer l
  1. We now apply the activation function to the net input to get the neuron’s activation:
The activation of neuron i in layer l
The activation of neuron i in layer l

By definition, the activations of the neurons in the input layer are equal to the feature values of the example currently presented to the network, i.e.,

The activations of the input neurons
The activations of the input neurons

where m is the number of features in the data set.

Vectorized Form

To make the computations more efficient (especially when using numerical libraries like NumPy), we typically use the vectorized form of the above equations.

We first define the vector aˡ as the vector containing the activations of all the neurons in layer l, and the vector *b_*_ˡ as the vector with the biases of all the neurons in layer l.

We also define as the matrix of connection weights from all the neurons in layer l – 1 to all the neurons in layer l. For example, _W_¹₂₃ is the weight of the connection between neuron no. 2 in layer 0 (the input layer) and neuron no. 3 in layer 1 (the first hidden layer).

We can now write the forward propagation equations in vector form. For each layer l we compute:

The vectorized form of the forward propagation equations
The vectorized form of the forward propagation equations

Solving the XOR Problem

The first demonstration of the power of MLPs over single perceptrons has shown that they are capable of solving the XOR problem. The XOR problem is not linearly separable, thus a single perceptron cannot solve it:

The XOR problem
The XOR problem

However, an MLP with a single hidden layer can easily solve this problem:

An MLP that solves the XOR problem. The bias terms are written inside the nodes.
An MLP that solves the XOR problem. The bias terms are written inside the nodes.

Let’s analyze how this MLP works. The MLP has three hidden neurons and one output neuron in addition to the two input neurons. We assume here that all the neurons use the step activation function (i.e., the function whose value is 1 for all non-negative inputs, and 0 for all negative inputs).

The top hidden neuron is connected only to the first input _x_₁ with a connection weight of 1, and it has a bias of -1. Therefore, this neuron fires only when _x_₁ = 1 (in which case its net input is 1 × 1 + (-1) = 0, and f(0) = 1, where f is the step function).

The middle hidden neuron is connected to both inputs with connection weights of 1, and it has a bias of -2. Therefore, this neuron fires only when both inputs are 1.

The bottom hidden neuron is connected only to the second input _x_₂ with a connection weight of 1, and it has a bias of -1. Therefore, this neuron fires only when _x_₂ = 1.

The output neuron is connected to the top and the bottom hidden neurons with a weight of 1, and to the middle hidden neuron with a weight of -2, and it has a bias of -1. Therefore, it fires only when either the top or the bottom hidden neuron fire, but not when both of them fire together. In other words, it fires only when _x_₁ = 1 or _x_₂ = 1 but not when both inputs are 1, which is exactly what we expect the output of the XOR function to be.

For example, let’s compute the forward propagation of this MLP for the inputs _x_₁ = 1 and _x_₂ = 0. The activations of the hidden neurons in this case are:

The activations of the hidden neurons for x1 = 1 and x2 = 0
The activations of the hidden neurons for x1 = 1 and x2 = 0

We can see that only the top hidden neuron fires in this case.

The activation of the output neuron is therefore:

The output of the MLP for x1 = 1 and x2 = 0
The output of the MLP for x1 = 1 and x2 = 0

The output neuron fires in this case, which is what we expect the output of XOR to be for the inputs _x_₁ = 1 and _x_₂ = 0.

Verify that you understand how the MLP computes the other three cases of the XOR function as well!

MLP Construction Exercise

As another example, consider the following data set that contains points from three different classes:

Build an MLP that correctly classifies all the points in this data set. Hint: Use the hidden neurons to identify the three classification areas.

The solution can be found at the bottom of this article.

The Universal Approximation Theorem

One of the remarkable facts about MLPs is that they can compute any arbitrary function (even though each neuron in the network computes a very simple function such as the step function).

The universal approximation theorem states that an MLP with one hidden layer (with a sufficient number of neurons) can approximate any continuous function of the inputs arbitrarily well. With two hidden layers, it can even approximate discontinuous functions. This means that even very simple network architectures can be extremely powerful.

Unfortunately, the proof of the theorem is non-constructive, i.e., it does not tell us how to build a network to compute a specific function but only shows that such a network exists.

Learning in MLPs: Backpropagation

Although MLPs have proven to be computationally powerful, for a long time it was not clear how to train them on a specific data set. While single perceptrons have a simple weight update rule, it was not clear how to apply this rule to the weights of the hidden layers, since these do not directly affect the output of the network (and hence its training loss).

It took the AI community more than 30 years to solve this problem when in 1986 Rumelhart et al. introduced their groundbreaking backpropagation algorithm for training MLPs.

The main idea of backpropagation is to first compute the gradients of the error function of the network with respect to each one of its weights, and then use gradient descent to minimize the error. It is called backpropagation because we propagate the gradients of the error from the output layer back to the input layer using the chain rule of derivatives.

The backpropagation algorithm is thoroughly explained in this article.

Activation Functions

In single-layer perceptrons we have used either the step or the sign functions for the neuron’s activation. The issue with these functions is that their gradient is 0 almost everywhere (since they are equal to a constant value for x > 0 and for x < 0). This means that we cannot use them in gradient descent to find the minimum error of the network.

Therefore, in MLPs we need to use other activation functions. These functions should be both differentiable and non-linear (if all the neurons in an MLP use a linear activation function then the MLP behaves like a single-layer perceptron).

For the hidden layers, the three most common activation functions are:

  1. The sigmoid function
  1. The hyperpoblic tangent function
  1. The ReLU (rectified linear unit) function

The activation function in the output layer depends on the problem the network is trying to solve:

  1. For regression problems we use the identity function f(x) = x.
  2. For binary classification problems we use the sigmoid function (shown above).
  3. For multi-class classification problems we use the softmax function, which converts a vector of k real numbers into a probability distribution of k possible outcomes:
The softmax function
The softmax function

The reason why we use the softmax function in multi-class problems is explained in depth in this article:

Deep Dive into Softmax Regression

MLPs in Scikit-Learn

Scikit-Learn provides two classes that implement MLPs in the sklearn.neural_network module:

  1. MLPClassifier is used for classification problems.
  2. MLPRegressor is used for regression problems.

The important hyperparameters in these classes are:

  • _hidden_layersizes – a tuple that defines the number of neurons in each hidden layer. The default is (100,), i.e., a single hidden layer with 100 neurons. For many problems, using just one or two hidden layers should be enough. For more complex problems, you can gradually increase the number of hidden layers, until the network starts overfitting the training set.
  • activation – the activation function to use in the hidden layers. The options are ‘identity’, ‘logistic’, ‘tanh’, and ‘relu’ (the default).
  • solver – the solver to use for the weight optimization. The default is ‘adam’, which works well on most data sets. The behavior of the various optimizers will be explained in a future article.
  • alpha – the L2 regularization coefficient (defaults to 0.0001)
  • _batchsize – the size of the mini-batches used for training (defaults to 200).
  • _learningrate – learning rate schedule for weight updates (defaults to ‘constant’).
  • _learning_rateinit – the initial learning rate used (defaults to 0.001).
  • _earlystopping – whether to stop the training when the validation score is not improving (defaults to False).
  • _validationfraction – the proportion of the training set to set aside for validation (defaults to 0.1).

We normally use grid search and cross-validation to tune these hyperparameters.

Training an MLP on MNIST

For example, let’s train an MLP on the MNIST data set, which is a widely used data set for image classification tasks.

The data set contains 60,000 training images and 10,000 testing images of handwritten digits. Each image is 28 × 28 pixels in size, and is typically represented by a vector of 784 numbers in the range [0, 255]. The task is to classify these images into one of the ten digits (0–9).

We first fetch the MNIST data set using the fetch_openml() function:

from sklearn.datasets import fetch_openml

X, y = fetch_openml('mnist_784', return_X_y=True, as_frame=False)

The _asframe parameter specifies that we want to get the data and the labels as NumPy arrays instead of DataFrames (the default of this parameter has changed in Scikit-Learn 0.24 from False to ‘auto’).

Let’s examine the shape of X:

print(X.shape)
(70000, 784)

That is, X consists of 70,000 flat vectors of 784 pixels.

Let’s display the first 50 digits in the data set:

fig, axes = plt.subplots(5, 10, figsize=(10, 5))
i = 0
for ax in axes.flat:
    ax.imshow(X[i].reshape(28, 28), cmap='binary')
    ax.axis('off')    
    i += 1
The first 50 digits from the MNIST data set
The first 50 digits from the MNIST data set

Let’s check how many samples we have from each digit:

np.unique(y, return_counts=True)
(array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object),
 array([6903, 7877, 6990, 7141, 6824, 6313, 6876, 7293, 6825, 6958],
       dtype=int64))

The data set is fairly balanced between the 10 classes.

We now scale the inputs to be within the range [0, 1] instead of [0, 255]:

X = X / 255

Feature scaling makes the training of neural networks faster and prevents them from getting stuck in local optima.

We now split the data into training and test sets. Note that the first 60,000 images in MNIST are already designated for training, so we can just use simple slicing for the split:

train_size = 60000
X_train, y_train = X[:train_size], y[:train_size]
X_test, y_test = X[train_size:], y[train_size:]

We now create an MLP classifier with a single hidden layer with 300 neurons. We will keep all the other hyperparameters with their default values, except for _earlystopping which we will change to True. We will also set verbose=True in order to track the progress of the training:

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(300,), early_stopping=True, 
                    verbose=True)

Let’s fit the classifier to the training set:

mlp.fit(X_train, y_train)

The output we get during training is:

Iteration 1, loss = 0.35415292
Validation score: 0.950167
Iteration 2, loss = 0.15504686
Validation score: 0.964833
Iteration 3, loss = 0.10840875
Validation score: 0.969833
Iteration 4, loss = 0.08041958
Validation score: 0.972333
Iteration 5, loss = 0.06253450
Validation score: 0.973167
...
Iteration 31, loss = 0.00285821
Validation score: 0.980500
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.

The training stopped after 31 iterations, since the validation score has not improved during the previous 10 iterations.

Let’s check the accuracy of the MLP on the training and the test sets:

print('Accuracy on training set:', mlp.score(X_train, y_train))
print('Accuracy on test set:', mlp.score(X_test, y_test))
Accuracy on training set: 0.998
Accuracy on test set: 0.9795

These are great results, but networks with more complex architectures such as convolutional neural networks (CNNs) can achieve even better results on this data set (up to 99.91% accuracy on the test!). You can find the state-of-the-art results on MNIST with links to the relevant papers here.

To understand better the errors of our model, let’s display its confusion matrix:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_pred = mlp.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=mlp.classes_)
disp.plot(cmap='Blues')
The confusion matrix on the test set
The confusion matrix on the test set

We can see that the main confusions of the model are between the digits 4⇔9, 7⇔9 and 2⇔8. This makes sense since these digits often resemble each other when written by hand. To help our model distinguish between these digits, we can add more examples from these digits (e.g., by using data augmentation) or extract additional features from the images (e.g., the number of closed loops in the digit).

Visualizing the MLP Weights

Although neural networks are generally considered to be "black-box" models, in simple networks that consist of one or two hidden layers, we can visualize the learned weights and occasionally gain some insight into how these networks work internally.

For example, we can plot the weights between the input and the hidden layers of our MLP classifier. The weight matrix has a shape of (784, 300), and is stored in a variable called mlp.coefs_[0]:

print(mlp.coefs_[0].shape)
(784, 300)

Column i of this matrix represents the weights of the incoming inputs to hidden neuron i. We can display this column as a 28 × 28 pixel image, in order to examine which input neurons have a stronger influence on this neuron’s activation.

The following plot displays the weights of the first 20 hidden neurons:

fig, axes = plt.subplots(4, 5)

for coef, ax in zip(mlp.coefs_[0].T, axes.flat):
    im = ax.imshow(coef.reshape(28, 28), cmap='gray')
    ax.axis('off')

fig.colorbar(im, ax=axes.flat)
The weights of the first 20 hidden neurons
The weights of the first 20 hidden neurons

We can see that each hidden neuron focuses on different segments of the image.

MLPs in Other Libraries

Although the MLP classifier in Scikit-Learn is easy to use, in practical applications you are more likely to use a Deep Learning library such as TensorFlow or PyTorch to build MLPs. These libraries can take advantage of faster GPU processing and they also provide many additional options, such as additional activation functions and optimizers. You can find an example of how to use these libraries in this post.


Solution to the MLP Construction Exercise

The following MLP classifies correctly all the points in the data set:

MLP for solving the classification problem. The bias terms are written inside the nodes.
MLP for solving the classification problem. The bias terms are written inside the nodes.

Explanation:

The left hidden neuron fires only when _x_₁ ≤ 3, the middle hidden neuron fires only when _x_₂ ≥ 4, and the right hidden neuron fires only when _x_₂ ≤ 0.

The left output neuron performs an OR between the left and the middle hidden neurons, therefore it fires only if _x_₁ ≤ 3 OR _x_₂ ≥ 4, i.e., only when the point is blue.

The middle output neuron performs a NOR (Not OR) between all the hidden neurons, therefore it fires only when NOT (_x_₁ ≤ 3 OR _x_₂ ≥ 4 OR _x_₂ ≤ 0). In other words, it fires only when _x_₁ > 3 AND 0 < _x_₂ < 4, i.e., only when the point is red.

The right output neuron fires only when the right hidden neuron fires, i.e., only when _x_₂ ≤ 0, which is true only for the purple points.

Final Notes

You can find the code examples of this article on my github: https://github.com/roiyeho/medium/tree/main/mlp

All images unless otherwise noted are by the author.

MNIST Dataset Info:

  • Citation: Deng, L., 2012. The mnist database of handwritten digit images for Machine Learning research. IEEE Signal Processing Magazine, 29(6), pp. 141–142.
  • License: Yann LeCun and Corinna Cortes hold the copyright of the MNIST dataset which is available under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA).

Thanks for reading!


Related Articles