Overview and implementation in Python

Perceptrons are one of the earliest computational models of Neural Networks (NNs), and they form the basis for the more complex and deep networks we have today. Understanding the perceptron model and its theory will provide you with a good basis for understanding many of the key concepts in neural networks in general.
Background: Biological Neural Networks
A biological neural network (such as the one we have in our brain) is composed of a large number of nerve cells called neurons.
Each neuron receives electrical signals (impulses) from its neighboring neurons via fibers called dendrites. When the total sum of its incoming signals exceeds some threshold, the neuron "fires" its own signal via long fibers called axons that are connected to the dendrites of other neurons.
The junction between two neurons is called a synapse. On average, each neuron is connected to about 7,000 synapses, which demonstrates the high connectivity of the network we have in our brain. When we learn new associations between two concepts, the synaptic strength between the neurons that represent these concepts is strengthened. This phenomenon is known as Hebb’s rule (1949) that states "Cells that fire together wire together".

The Perceptron Model
The Perceptron model, introduced by Frank Rosenblatt in 1957, is a simplified model of a biological neuron.
The perceptron has m binary inputs denoted by _x_₁, …, xₘ, which represent the incoming signals from its neighboring neurons, and it outputs a single binary value denoted by o that indicates if the perceptron is "firing" or not.

Each input neuron xᵢ is connected to the perceptron via a link whose strength is represented by a weight wᵢ. Inputs with higher weights have a larger influence on the perceptron’s output.
The perceptron first computes the weighted sum of its incoming signals, by multiplying each input by its corresponding weight. This weighted sum is often called net input and denoted by z:

If the net input exceeds some predefined threshold value θ, then the perceptron fires (its output is 1), otherwise it doesn’t fire (its output is 0). In other words, the perceptron fires if and only if:

Our goal is to find the weights _w_₁, …, wₘ and the threshold θ, such that the perceptron will map correctly its inputs _x_₁, …, xₘ (representing the features in our data) to the desired output y (representing the label).
To simplify the learning process, instead of having to learn separately the weights and the threshold, we add a special input neuron called bias neuron that always outputs the value 1. This neuron is typically denoted by _x_₀ and its connection weight is denoted by b or _w_₀.
As a result, the net input of the perceptron becomes:

This formulation allows us to learn the correct threshold (bias) as if it were one of the weights of the incoming signals.
In vector form, we can write z as the dot product between the input vector x = (_x_₁, …, xₘ)ᵗ and the weight vector w = (_w_₁, …, wₘ)ᵗ plus the bias:

And the perceptron fires if and only if the net input is non-negative, i.e.,

More generally, the perceptron applies an activation function f(z) on the net input that generates its output. The two most common activation functions used in perceptrons are:
- The step function (also known as the heaviside function) is a function whose value is 0 for negative inputs and 1 for non-negative inputs:

- The sign function is a function whose value is -1 for negative inputs and 1 for non-negative inputs:

Other types of activation functions are used in more complex networks, such as multi-layer perceptrons (MLPs). For the rest of this article, I will assume that the perceptron is using the step function.
To summarize, the computation of the perceptron consists of two steps:
- Multiplication of the input values _x_₁, …, xₘ by their corresponding weights _w_₁, …, wₘ, and adding the bias b, which gives us the net input of the perceptron z = wᵗx + b.
- Applying an activation function f(z) on the net input that generates a binary output (0/1 or -1/+1).
We can write this entire computation in one equation:

where f is the chosen activation function and o is the output of the perceptron.
Implementing Logic Gates with Perceptrons
To demonstrate how perceptrons work, let’s try to build perceptrons that compute the logical functions AND and OR.
As a reminder, the logical AND function has two binary inputs and returns true (1) if both of its inputs are true, otherwise it returns false (0).

A perceptron that implements the AND function has two binary inputs and a bias. We want this perceptron to "fire" only when both of its inputs are "firing". This can achieved, for example, by choosing the same weight for both inputs, e.g., _w_₁ = _w_₂ = 1, and then choosing the bias to be within the range [-2, -1). This will make sure that when both neurons are firing, the net input 2 + b will be non-negative, but when only one of them is firing, the net input 1 + b will be negative (and when none of them is firing the net input b is also negative).

In a similar fashion, we can build a perceptron that computes the logical OR function:

Verify that you understand how this perceptron works!
As an exercise, try to build a perceptron for the NAND function, whose truth table is shown below:

Perceptrons as Linear Classifiers
The perceptron is a type of a linear classifier, since it divides the input space into two areas separated by the following hyperplane

The weight vector w is orthogonal to this hyperplane, and thus determines its orientation, while the bias b defines its distance from the origin.
Every example above the hyperplane (wᵗx + b ** > 0) is classified by the perceptron as a positive example, while every example below the hyperplane (wᵗx + __ b < 0**) is classified as a negative example.

Other linear classifiers include logistic regression and linear SVMs (support vector machines).
Linear classifiers are capable of learning only linearly separable problems, i.e., problems where the decision boundary between the positive and the negative examples is a linear surface (a hyperplane).
For example, the following data set is not linearly separable, therefore a perceptron cannot classify correctly all the examples in this data set:

The Perceptron Learning Rule
The perceptron has a simple learning rule that is guaranteed to find the separating hyperplane if the data is linearly separable.
For each training sample (xᵢ, yᵢ) that is misclassified by the perceptron (i.e., oᵢ ≠ yᵢ), we apply the following update rule to the weight vector:

where α is a learning rate (0 < α ≤ 1) that controls the size of the weight adjustment in each update.
In other words, we add to each connection weight wⱼ the error of the perceptron on this example (the difference between the true label yᵢ and the output oᵢ) multiplied by the value of the corresponding input xᵢⱼ and the learning rate.
What this learning rule tries to do is to reduce the discrepancy between the perceptron’s output oᵢ and the true label yᵢ. To understand why it works, let’s examine the two possible cases of a misclassification by the perceptron:
-
The true label is yᵢ = 1, but the perceptron’s prediction is oᵢ = _ 0, i.e., *w*ᵗ*x*ᵢ + b < 0. In this case, we would like to increase the perceptron’s net input so eventually it becomes positive. To that end, we add the quantity (yᵢ – oᵢ)*x*ᵢ = *x*ᵢ_ to the weight vector (multiplied by the learning rate). This increases the weights of the inputs with positive values (where _xᵢ_ⱼ > 0), while decreasing the weights of the inputs with negative values (where _xᵢ_ⱼ < 0). Consequently, the overall net input of the perceptron increases.
- The true label is yᵢ = 0, but the perceptron’s prediction is oᵢ = 1, i.e., wᵗxᵢ + b > 0. Analogously to the previous case, here we would like to decrease the perceptron’s net input, so eventually it becomes negative.This is achieved by adding the quantity (yᵢ – oᵢ)**x**ᵢ = –xᵢ to the weight vector, since this decreases the weights of the inputs with positive values (where xᵢⱼ > 0) while increasing the weights of the inputs with negative values (where xᵢⱼ < 0). Consequently, the overall net input of the perceptron decreases.
This learning rule is applied to all the training samples sequentially (in an arbitrary order). It typically requires more than one iteration over the entire training set (called an epoch) to find the correct weight vector (i.e., the vector of a hyperplane that separates between the positive and the negative examples).
According to the perceptron convergence theorem, if the data is linearly separable, applying the perceptron learning rule repeatedly will eventually converge to the weights of the separating hyperplane (in a finite number of steps). The interested reader can find a formal proof of this theorem in this paper.
The Perceptron Learning Algorithm
In practice, it may take long time for the perceptron learning process to converge (i.e., reach zero errors on the training set). Furthermore, the data itself may not be linearly separable, in which case the algorithm may never terminate. Therefore, we need to limit the number of training epochs by some predefined parameter. If the perceptron achieves zero errors on the training set before this number is reached, we can stop its training earlier.
The perceptron learning algorithm is summarized in the following pseudocode:

Note that the weights are typically initialized to small random values in order to break the symmetry (if all the weights were equal, then the output of the perceptron would be constant for every input), while the bias is initialized to zero.
Example: Learning the Majority Function
For example, let’s see how the perceptron learning algorithm can be used to learn the majority function of three binary inputs. The majority function is a function that evaluates to true (1) when half or more of its inputs are true and to false (0) otherwise.
The training set of the perceptron includes all the 8 possible binary inputs:

In this example we will assume that the initial weights and bias are 0, and the learning rate is α = 0.5.
Let’s track the weight updates during the first epoch of training. The first sample presented to the perceptron is x = (0, 0, 0)ᵗ. The net input of the perceptron in this case is: z = wᵗx + b = 0 × 0 + 0 × 0 + 0 × 0 + 0 = 0. Therefore, its output is o = 1 (remember that the step function outputs 1 whenever its input is ≥ 0). However, the target label in this case is y = 0, so the error made by the perceptron is y – o = -1.
According to the perceptron learning rule, we update each weight wᵢ by adding to it α(y – o)xᵢ = -0.5xᵢ. Since all the inputs are 0 in this case, except for the bias neuron (_x_₀ = 1), we only update the bias to be -0.5 instead of 0.
We repeat the same process for all the other 7 training samples. The following table shows the weight updates after every sample:

During the first epoch the perceptron has made 4 errors. The weight vector after the first epoch is w = (0, 0.5, 1)ᵗ and the bias is 0.
In the second training epoch, we get the following weight updates:

This time the perceptron has made only three errors. The weight vector after the second epoch is w = (0.5, 0.5, 1)ᵗ and the bias is -0.5.
The weight updates in the third epoch are:

After the update of the second example in this epoch, the perceptron has converged to the weight vector that solves this classification problem: w = (0.5, 0.5, 0.5)ᵗ and b = -1. Since all the weights are equal, the perceptron fires only when at least two of the inputs are 1, in which case their weighted sum is at least 1, i.e., greater or equal than the absolute value of the bias, hence the net input of the perceptron is non-negative.
Perceptron Implementation in Python
Let’s now implement the perceptron learning algorithm in Python.
We will implement it as a custom Scikit-Learn estimator by extending the sklearn.base.BaseEstimator class. This will allow us to use it as any other estimator in Scikit-Learn (e.g., adding it to a pipeline).
A custom estimator needs to implement the fit() and predict() methods, and set all its hyperparameters in the init() method.
I will first show the complete code of this class, and then walk through it step-by-step.
from sklearn.base import BaseEstimator
class Perceptron(BaseEstimator):
def __init__(self, alpha, n_epochs):
self.alpha = alpha # the learning rate
self.n_epochs = n_epochs # number of training iterations
def fit(self, X, y):
(n, m) = X.shape # n is the number of samples, m is the number of features
# Initialize the weights to small random values
self.w = np.random.randn(m)
self.b = 0
# The training loop
for epoch in range(self.n_epochs):
n_errors = 0 # number of misclassification errors
for i in range(n):
o = self.predict(X[i])
if o != y[i]:
# Apply the perceptron learning rule
self.w += self.alpha * (y[i] - o) * X[i]
self.b += self.alpha * (y[i] - o)
n_errors += 1
# Compute the accuracy on the training set
accuracy = 1 - (n_errors / n)
print(f'Epoch {epoch + 1}: accuracy = {accuracy:.3f}')
# Stop the training when there are no more errors
if n_errors == 0:
break
def predict(self, X):
z = X @ self.w + self.b
return np.heaviside(z, 1) # the step function
The constructor of the class initializes the two hyperparameters of the model: the learning rate (alpha) and the number of training epochs (_nepochs).
def __init__(self, alpha, n_epochs):
self.alpha = alpha
self.n_epochs = n_epochs
The fit() method runs the learning algorithm on a given data set X with labels y. We first find out how many samples and features we have in the data set by interrogating the shape of X:
(n, m) = X.shape
n is the number of training samples and m is the number of features.
Next, we initialize the weight vector using the standard normal distribution (with mean 0 and standard deviation of 1), and the bias to 0:
self.w = np.random.randn(m)
self.b = 0
We now run the training loop for _nepochs iterations. In each iteration, we go over all the training samples, and for each sample we check if the perceptron classifies it correctly by calling the predict() method and comparing its output to the true label:
for i in range(n):
o = self.predict(X[i])
if o != y[i]:
If the perceptron has misclassified the sample, we apply the weight update rule to both the weight vector and the bias, and then increment the number of misclassification errors by 1:
self.w += self.alpha * (y[i] - o) * X[i]
self.b += self.alpha * (y[i] - o)
n_errors += 1
When the epoch terminates, we report the perceptron’s current accuracy on the training set, and if the number of errors was 0, we terminate the training loop:
accuracy = 1 - (n_errors / n)
print(f'Epoch {epoch + 1}: accuracy = {accuracy:.3f}')
if n_errors == 0:
break
The predict() method is quite straightforward. We first compute the net input of the perceptron as the dot product between the input vector and the weights plus the bias:
z = X @ self.w + self.b
Finally, we use NumPy’s heaviside() function to apply the step function to the net input and return the output:
return np.heaviside(z, 1)
The second parameter of np.heaviside() specifies what should be the value of the function for z = 0.
Let’s now test our implementation on a data set generated by the make_blobs() function from Scikit-Learn.
We first generate a data set with 100 random points divided into two groups:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=0.5)
We set _clusterstd to 0.5 (instead of the default 1) in order to make sure that the data is linearly separable.
Let’s plot the data set:
import seaborn as sns
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, style=y, markers=('s', 'o'),
palette=('r', 'b'), edgecolor='black')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')

We now instantiate the Perceptron class and fit it to the data set:
perceptron = Perceptron(alpha=0.01, n_epochs=10)
perceptron.fit(X, y)
The output during training is:
Epoch 1: accuracy = 0.250
Epoch 2: accuracy = 0.950
Epoch 3: accuracy = 1.000
The perceptron has converged after three epochs of training.
We can plot the decision boundary found by the perceptron and the two class areas using the following function:
def plot_decision_boundary(model, X, y):
# Retrieve the model parameters
w1, w2, b = model.w[0], model.w[1], model.b
# Calculate the intercept and slope of the separating line
slope = -w1 / w2
intercept = -b / w2
# Plot the line
x1 = X[:, 0]
x2 = X[:, 1]
x1_min, x1_max = x1.min() - 0.2, x1.max() + 0.2
x2_min, x2_max = x2.min() - 0.5, x2.max() + 0.5
x1_d = np.array([x1_min, x1_max])
x2_d = slope * x1_d + intercept
# Fill the two classification areas with two different colors
plt.plot(x1_d, x2_d, 'k', ls='--')
plt.fill_between(x1_d, x2_d, x2_min, color='blue', alpha=0.25)
plt.fill_between(x1_d, x2_d, x2_max, color='red', alpha=0.25)
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
# Draw the data points
sns.scatterplot(x=x1, y=x2, hue=y, style=y, markers=('s', 'o'),
palette=('r', 'b'), edgecolor='black')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plot_decision_boundary(perceptron, X, y)

Scikit-Learn provides its own Perceptron class that implements a similar algorithm, but provides more options such as regularization and early stopping.
Limitations of the Perceptron Model
Although the perceptron model has shown some initial success, it was quickly realized that perceptrons cannot learn some simple functions such as the XOR function:

The XOR problem is not linearly separable, therefore linear models such as perceptrons cannot solve it.
This revelation has caused the field of neural networks to stagnate for many years (a period known as "the AI winter"), until it was realized that stacking multiple perceptrons in layers can solve more complex and non-linear problems such as the XOR problem.
Multi-layer perceptrons (MLPs) are covered in this article.
Final Notes
All images unless otherwise noted are by the author.
You can find the code samples of this article on my github: https://github.com/roiyeho/medium/tree/main/perceptrons
Thanks for reading!