Perceptron Learning Algorithm: A Graphical Explanation Of Why It Works

Published in

Towards Data Science

8 min readAug 22, 2018

This post will discuss the famous Perceptron Learning Algorithm, originally proposed by Frank Rosenblatt in 1943, later refined and carefully analyzed by Minsky and Papert in 1969. This is a follow-up post of my previous posts on the McCulloch-Pitts neuron model and the Perceptron model.

Citation Note: The concept, the content, and the structure of this article were entirely based on Prof. Mitesh Khapra’s lectures slides and videos of the CS7015: Deep Learning course taught at IIT Madras.

Perceptron

You can just go through my previous post on the perceptron model (linked above) but I will assume that you won’t. So here goes, a perceptron is not the Sigmoid neuron we use in ANNs or any deep learning networks today.

The perceptron model is a more general computational model than McCulloch-Pitts neuron. It takes an input, aggregates it (weighted sum) and returns 1 only if the aggregated sum is more than some threshold else returns 0. Rewriting the threshold as shown above and making it a constant input with a variable weight, we would end up with something like the following:

A single perceptron can only be used to implement linearly separable functions. It takes both real and boolean inputs and associates a set of weights to them, along with a bias (the threshold thing I mentioned above). We learn the weights, we get the function. Let's use a perceptron to learn an OR function.

OR Function Using A Perceptron

What’s going on above is that we defined a few conditions (the weighted sum has to be more than or equal to 0 when the output is 1) based on the OR function output for various sets of inputs, we solved for weights based on those conditions and we got a line that perfectly separates positive inputs from those of negative.

Doesn’t make any sense? Maybe now is the time you go through that post I was talking about. Minsky and Papert also proposed a more principled way of learning these weights using a set of examples (data). Mind you that this is NOT a Sigmoid neuron and we’re not going to do any Gradient Descent.

Warming Up — Basics Of Linear Algebra

Vector

A vector can be defined in more than one way. For a physicist, a vector is anything that sits anywhere in space, has a magnitude and a direction. For a CS guy, a vector is just a data structure used to store some data — integers, strings etc. For this tutorial, I would like you to imagine a vector the Mathematician way, where a vector is an arrow spanning in space with its tail at the origin. This is not the best mathematical way to describe a vector but as long as you get the intuition, you’re good to go.

Note: I have borrowed the following screenshots from 3Blue1Brown’s video on Vectors. If you don’t know him already, please check his series on Linear Algebra and Calculus. He is just out of this world when it comes to visualizing Math.

Vector Representations

A 2-dimensional vector can be represented on a 2D plane as follows:

*Source:* *3Blue1Brown’s video on* *Vectors*

Carrying the idea forward to 3 dimensions, we get an arrow in 3D space as follows:

Dot Product Of Two Vectors

At the cost of making this tutorial even more boring than it already is, let's look at what a dot product is. Imagine you have two vectors oh size n+1, w and x, the dot product of these vectors (w.x) could be computed as follows:

The transpose is just to write it in a matrix multiplication form.

Here, w and x are just two lonely arrows in an n+1 dimensional space (and intuitively, their dot product quantifies how much one vector is going in the direction of the other). So technically, the perceptron was only computing a lame dot product (before checking if it's greater or lesser than 0). The decision boundary line which a perceptron gives out that separates positive examples from the negative ones is really just w . x = 0.

Angle Between Two Vectors

Now the same old dot product can be computed differently if only you knew the angle between the vectors and their individual magnitudes. Here’s how:

The other way around, you can get the angle between two vectors, if only you knew the vectors, given you know how to calculate vector magnitudes and their vanilla dot product.

When I say that the cosine of the angle between w and x is 0, what do you see? I see arrow w being perpendicular to arrow x in an n+1 dimensional space (in 2-dimensional space to be honest). So basically, when the dot product of two vectors is 0, they are perpendicular to each other.

Setting Up The Problem

We are going to use a perceptron to estimate if I will be watching a movie based on historical data with the above-mentioned inputs. The data has positive and negative examples, positive being the movies I watched i.e., 1. Based on the data, we are going to learn the weights using the perceptron learning algorithm. For visual simplicity, we will only assume two-dimensional input.

Perceptron Learning Algorithm

Our goal is to find the w vector that can perfectly classify positive inputs and negative inputs in our data. I will get straight to the algorithm. Here goes:

We initialize w with some random vector. We then iterate over all the examples in the data, (P U N) both positive and negative examples. Now if an input x belongs to P, ideally what should the dot product w.x be? I’d say greater than or equal to 0 because that’s the only thing what our perceptron wants at the end of the day so let's give it that. And if x belongs to N, the dot product MUST be less than 0. So if you look at the if conditions in the while loop:

Case 1: When x belongs to P and its dot product w.x < 0
Case 2: When x belongs to N and its dot product w.x ≥ 0

Only for these cases, we are updating our randomly initialized w. Otherwise, we don’t touch w at all because Case 1 and Case 2 are violating the very rule of a perceptron. So we are adding x to w (ahem vector addition ahem) in Case 1 and subtracting x from w in Case 2.

Why Would The Specified Update Rule Work?

But why would this work? If you get it already why this would work, you’ve got the entire gist of my post and you can now move on with your life, thanks for reading, bye. But if you are not sure why these seemingly arbitrary operations of x and w would help you learn that perfect w that can perfectly classify P and N, stick with me.

We have already established that when x belongs to P, we want w.x > 0, basic perceptron rule. What we also mean by that is that when x belongs to P, the angle between w and x should be _____ than 90 degrees. Fill in the blank.

Answer: The angle between w and x should be less than 90 because the cosine of the angle is proportional to the dot product.

So whatever the w vector may be, as long as it makes an angle less than 90 degrees with the positive example data vectors (x E P) and an angle more than 90 degrees with the negative example data vectors (x E N), we are cool. So ideally, it should look something like this:

x_0 is always 1 so we ignore it for now.

So we now strongly believe that the angle between w and x should be less than 90 when x belongs to P class and the angle between them should be more than 90 when x belongs to N class. Pause and convince yourself that the above statements are true and you indeed believe them. Here’s why the update works:

Now this is slightly inaccurate but it is okay to get the intuition.

So when we are adding x to w, which we do when x belongs to P and w.x < 0 (Case 1), we are essentially increasing the cos(alpha) value, which means, we are decreasing the alpha value, the angle between w and x, which is what we desire. And the similar intuition works for the case when x belongs to N and w.x ≥ 0 (Case 2).

Here’s a toy simulation of how we might up end up learning w that makes an angle less than 90 for positive examples and more than 90 for negative examples.

Proof Of Convergence

Now, there is no reason for you to believe that this will definitely converge for all kinds of datasets. It seems like there might be a case where the w keeps on moving around and never converges. But people have proved it that this algorithm converges. I am attaching the proof, by Prof. Michael Collins of Columbia University — find the paper here.

Conclusion

In this post, we quickly looked at what a perceptron is. We then warmed up with a few basics of linear algebra. We then looked at the Perceptron Learning Algorithm and then went on to visualize why it works i.e., how the appropriate weights are learned.

Thank you for reading this post.
Live and let live!
A