The world’s leading publication for data science, AI, and ML professionals.

Terribly Perceptive: The perceptron & the XOR

How a seemingly innocuous problem ushered in the first AI winter. (and how to build your own perceptron)

Image by Ross Stone on Unsplash.
Image by Ross Stone on Unsplash.

Though being meta of post-modern proportions, it is all but unlikely that you have been lead to this article by a machine learning algorithm of one stripe or another. ML has reached almost every facet of our lives – from recommending movies to detecting credit card fraud – whether or not we are aware of it. The last decade has seen a record-setting hype for ML and data science in general, with over-priced online courses now even being advertised by major academic institutions. For an outsider (or beginner) it may seem as though ML is no older than the hype surrounding it, but they would be mistaken. The broad and maybe somewhat euphemistic term "Machine Learning” encapsulates a large variety of applied statistical tools, many of which are actually quite old. More specifically Neural Networks are also older than most people might think: these mathematical approximations of neural machinations were first devised in the 1950s.

So why all this hype now? That is a good question – the answer of which is multi-dimensional. However; this is not the question I will concern myself with here. Here, I want to dig into why we ought to manage our expectations for ML in the future, looking at a case study that may seem ancient to some: The problem of the perceptron and the XOR.

Apart from sounding like a Sherlock Holmes short-story or a novel by J.K. Rowling, it is the example for how a seemingly innocuous problem could lay a halt to the entire field of ML. While time is a series and history rhymes, the seasons truly are cyclical and the "AI winter” ushered in in 1969 by Minsky and Papert is unlikely to have been the last.

(code on GitHub)


The Perceptron

Fundamentals

What is the Perceptron? The perceptron is generally regarded as the first formal neural net and was proposed by Frank Rosenblatt in 1958. Its structure and the underlying maths are relatively straightforward, so let’s take some time to explore them.

The perceptron consists of two layers: an input layer and an output layer (fig. 1). The input layer can take N number of units – i.e. an N dimensional vector -and the output layer can return a T dimensional vector of activations. Mathematically, the perceptron is not much more than some relatively simple linear algebra and vector calculus.

A simple perceptron with two inputs, x1 and x2, and a single output unit, z, with activation y. Each input has an associated weight, w1 and w2 respectively. Additionally, I treat the threshold (or the bias) as another input x0 (always = -1) with its respective weight w0. Image by author.
A simple perceptron with two inputs, x1 and x2, and a single output unit, z, with activation y. Each input has an associated weight, w1 and w2 respectively. Additionally, I treat the threshold (or the bias) as another input x0 (always = -1) with its respective weight w0. Image by author.

For example, let’s imagine we are a musician – a guitarist to be precise. We are new to playing guitar and want to learn some songs to impress our crush with. We know that they enjoy reggae music and download 1024 guitar tabs from some website, but being our usual messy selves include an unknown number of heavy metal songs.

While we are very messy, we are also very clever. We realise that a way to differentiate the genres using tabs only would be to find whether upward or downward strokes predominate in a randomly picked section of the song. Reggae -with its laid-back and mellow attitude – is characterised in part by staccato, upward guitar strokes, whereas metal’s heavy and harsh demeanor calls for consistent and pummeling downward strokes. Furthermore, metal often has very fast sections, which need to be played with an equal combination of upward and downward strokes.

In light of these realisations, we assign the strokes as follows: downward = 1; upward = -1. Then we assign the genres: metal = 1; reggae = 0. We take a 10-stroke snippet from each tab, giving us 1024 10-dimensional vectors.

The vector x captures the 10 guitar strokes from a randomly selected part of song µ. In the case of the above, x is the heart-wrenching intro to Metallica’s Master of Puppets. Now, a simple solution to find the reggae songs may be that we simply sum over the vectors and filter all x^µ which return values > 0.

But this is too simple for us and when else will we find use for a simple perceptron? So let’s start by setting up some of the preamble:

Now let’s inspect some of the underlying maths. As can be seen in figure 1, each unit xi of the input layer is connected to the output unit z with weight wi. The value of z is the weighted sum of all the input units, i.e. the dot product between the input vector x and the weights vector w:

The final output of the network is the result y of the transfer/activation function f(z), a.k.a. the activation of the output unit. There are a few options for the activation function -such as ReLU and tanh— although the classic choice for a simple binary classification problem such as this would be the logistic activation function:

This is partly because it provides us with values between 1 and 0. Let’s define this function in our Python script:

Additionally -unlike a step function -it has a defined, non-zero gradient and it gives us a nice, clean expression for the derivative:

which we can also define in our Python script:

You are probably wondering what we need the derivative for. Fret not, my dear reader, all in good time.

So how do we train the network? Fundamentally, a neural net (NN) learns by iteratively changing its weights in order to optimize its output. There are two types of optimization: either by minimising the error of the output relative to some target value or by maximising the "merit” of the network output. The more intuitive and more widely used method is error minimisation via the use of some error (or loss) function E. We will be using the simple quadratic loss function:

where n is the total number of output units, t is the training signal, i.e. the ground truth result for the input data (reggea = 0 or metal = 1) and y is the prediction made by network based on the input data. Since our output will be a scalar (n=1) we can ignore the summation. Let’s define this in Python:

So how do we effectively minimise the output from this error function? This is where our derivative for y comes into play. The method is called gradient descent and can be analogised to walking down a hill at night: You can tell which way is down at a certain step but you cannot see the valley. So to reach the valley you keep making consecutive steps into the most downhill direction.

In mathematical terms, we end up descending down a gradient of the error surface. The error surface is the output of the error function at different weights. Look at figure 2 for a toy example: we have a 2 dimensional input vector and 2 associated weights, _w1 and _w2. As we vary the weights, the error of the perceptron changes. We change the weights so that we reach the global minimum of the error surface.

Gradient descent works by taking the derivative of the error function w.r.t. a given weight _wi and adding a fraction of this to the said weight. Let’s define these expressions:

where epsilon is the step-size, i.e. a hyper-parameter which we can choose to define the rate of gradient descent (a.k.a learning-rate). Here, the expression of the derivative in Python:

Congratulations! We have now set up all the necessary base-level maths needed for the perceptron. If you have been able to follow thus far, the next steps will be merely procedural.


Terminologies & Technicalities

Before we implement the perceptron, let’s cover a few basic terms:

  • Iteration – an iteration is the presentation of a single data-point to the model
  • Epoch— is the presentation of all data-points in the training set to the model
  • Training set – is the user-defined subset of the total data-set used for training the model (i.e. adjusting the weights); usually 70 – 90% of the data-set
  • Validation set— is used to validate the error measured at each epoch during training and is presented to the model in parallel with the training set. Error signal generated from this set is not used for updating the model weights. It is usually about 10% of the total data-set
  • Test set – is used to test quality of the model after training; usually roughly 10% of the total data-set
  • Online-learning -the model weights are updated with every iteration

Finally, it is important to mention something some of you might have already noticed: the mystery _x0 and _w0 (or "bias”) shown in blue in figure 1. The bias unit is an additional unit in a layer which is always "on”, i.e. always takes the value = -1 and has an associated weight for its connection. It is essential for effective convergence of the model.

I mention it here to avoid confusion during the implementation section, where you will see 11-d input and weight vectors. Now that we have all our i‘s dotted and t‘s crossed, let’s get into the implementation!

Implementation

After setting up all the functionalities, we need to build the surrounding infrastructure. We can start out by generating the list of the 10-d binary input vectors and initialising some model parameters (including the weights vector):

Here, I choose the initial weights randomly from a uniform distribution bounded by 0 and 1. You could, however, also initialise all weights as 1, for example. As it is a perceptron, you do not need to worry about local minima in the error surface – they do not exist.

Next, we split up the data-set into training, validation and test sets and set up the training (+validation) loop for the model:

Now it should be clear that this can easily be optimised, however; I wrote this with the intention of maximum readability.

We can now visualise the change in error for both training and validation data-sets over the user-defined (=500) epochs. As we can see in figure 3, the error of the network decreases for both training and validation sets over the epochs, showing that the model generalises nicely. Unsurprisingly, the validation set returns higher errors than the training set, but also converges onto zero as the epochs increase.

Figure 3: The change in network training and validation error over epochs. The error is the mean (per-input-vector) quadratic loss for a given epoch. The final, trained weights vector was applied to the test set; thus it has only one resultant error. Image by author.
Figure 3: The change in network training and validation error over epochs. The error is the mean (per-input-vector) quadratic loss for a given epoch. The final, trained weights vector was applied to the test set; thus it has only one resultant error. Image by author.

Finally, we need to test the network:

Again, in figure 3 we can see that the test-set also performs very well.(The line is a singular error value, because the test set is only applied once to the final weights vector.)

We now have a fully trained, validated and tested perceptron!

The XOR

What is a XOR?

For those unfamiliar with Boolean logic, a XOR gate (a.k.a. exclusive OR) is a type of logic gate which is on if and only if one of two inputs is on:

So what does this have to do with perceptrons? Well as it turns out, you can think of the perceptron as a gate, which takes the 2-dimensional input vectors and outputs a binary classification (on / off). Alternatively, you can think of it geometrically like so:

A geometric view of the XOR gate. Image by author.
A geometric view of the XOR gate. Image by author.

What’s the fuss?

So what’s the problem? Why does this simple problem break down the perceptron? For the answer, let us briefly revisit what the perceptron actually is – **** mathematically:

The perceptron, at its core, is a dot product between the weight vector w and the input vector x transformed by some activation function f(z)=y.

Writing out the dot product algebraically we get:

for a perceptron with 2 input units and a bias unit (x_0 = -1). Now this looks a lot like a linear equation, does it not? In fact, that is exactly what it is. A perceptron is technically not much more than a generalised, multivariate linear regression.

Hence, the way the perceptron solves a classification problem is by finding a N-1 dimensional hyperplane in N-dimensional feature space. This hyperplane creates two subspaces, each defining a class.

Hopefully, the issue with the XOR should now become apparent. Refer back to the geometric representation of the XOR gate. Now try to draw a straight line to effectively separate the two classes (out = 0 / 1) into two subspaces. Don’t worry, I’ve got time!


Done? How did you do? Not that easy, is it? Well, as it turns out, it actually isn’t possible. This is because this problem is non-linearly separable – i.e. you cannot draw a hyperplane to create two clean subspaces in the feature space. To do that, we would need to build a Multi-Layer Perceptron (MLP), but that is a topic for another day.

If you want to test this non-convergence yourself using the example perceptron we built above, all you need to do is change how you compute t. Redefine t as follows:

where N is the dimensionality of the input vector x.

Happy coding!


Related Articles