The world’s leading publication for data science, AI, and ML professionals.

Perceptrons: The First Neural Network Model

Overview and implementation in Python

Photo by Hal Gatewood on Unsplash
Photo by Hal Gatewood on Unsplash

Perceptrons are one of the earliest computational models of Neural Networks (NNs), and they form the basis for the more complex and deep networks we have today. Understanding the perceptron model and its theory will provide you with a good basis for understanding many of the key concepts in neural networks in general.

Background: Biological Neural Networks

A biological neural network (such as the one we have in our brain) is composed of a large number of nerve cells called neurons.

Each neuron receives electrical signals (impulses) from its neighboring neurons via fibers called dendrites. When the total sum of its incoming signals exceeds some threshold, the neuron "fires" its own signal via long fibers called axons that are connected to the dendrites of other neurons.

The junction between two neurons is called a synapse. On average, each neuron is connected to about 7,000 synapses, which demonstrates the high connectivity of the network we have in our brain. When we learn new associations between two concepts, the synaptic strength between the neurons that represent these concepts is strengthened. This phenomenon is known as Hebb’s rule (1949) that states "Cells that fire together wire together".

Biological neuron (public image freely licensed under Wikimedia Commons)
Biological neuron (public image freely licensed under Wikimedia Commons)

The Perceptron Model

The Perceptron model, introduced by Frank Rosenblatt in 1957, is a simplified model of a biological neuron.

The perceptron has m binary inputs denoted by _x_₁, …, xₘ, which represent the incoming signals from its neighboring neurons, and it outputs a single binary value denoted by o that indicates if the perceptron is "firing" or not.

The perceptron model
The perceptron model

Each input neuron xᵢ is connected to the perceptron via a link whose strength is represented by a weight wᵢ. Inputs with higher weights have a larger influence on the perceptron’s output.

The perceptron first computes the weighted sum of its incoming signals, by multiplying each input by its corresponding weight. This weighted sum is often called net input and denoted by z:

The net input of the perceptron
The net input of the perceptron

If the net input exceeds some predefined threshold value θ, then the perceptron fires (its output is 1), otherwise it doesn’t fire (its output is 0). In other words, the perceptron fires if and only if:

Our goal is to find the weights _w_₁, …, wₘ and the threshold θ, such that the perceptron will map correctly its inputs _x_₁, …, xₘ (representing the features in our data) to the desired output y (representing the label).

To simplify the learning process, instead of having to learn separately the weights and the threshold, we add a special input neuron called bias neuron that always outputs the value 1. This neuron is typically denoted by _x_₀ and its connection weight is denoted by b or _w_₀.

As a result, the net input of the perceptron becomes:

The net input including the bias
The net input including the bias

This formulation allows us to learn the correct threshold (bias) as if it were one of the weights of the incoming signals.

In vector form, we can write z as the dot product between the input vector x = (_x_₁, …, xₘ) and the weight vector w = (_w_₁, …, wₘ) plus the bias:

Vector form of the net input
Vector form of the net input

And the perceptron fires if and only if the net input is non-negative, i.e.,

More generally, the perceptron applies an activation function f(z) on the net input that generates its output. The two most common activation functions used in perceptrons are:

  1. The step function (also known as the heaviside function) is a function whose value is 0 for negative inputs and 1 for non-negative inputs:
The step function
The step function
  1. The sign function is a function whose value is -1 for negative inputs and 1 for non-negative inputs:
The sign function
The sign function

Other types of activation functions are used in more complex networks, such as multi-layer perceptrons (MLPs). For the rest of this article, I will assume that the perceptron is using the step function.

To summarize, the computation of the perceptron consists of two steps:

  1. Multiplication of the input values _x_₁, …, xₘ by their corresponding weights _w_₁, …, wₘ, and adding the bias b, which gives us the net input of the perceptron z = wx + b.
  2. Applying an activation function f(z) on the net input that generates a binary output (0/1 or -1/+1).

We can write this entire computation in one equation:

where f is the chosen activation function and o is the output of the perceptron.

Implementing Logic Gates with Perceptrons

To demonstrate how perceptrons work, let’s try to build perceptrons that compute the logical functions AND and OR.

As a reminder, the logical AND function has two binary inputs and returns true (1) if both of its inputs are true, otherwise it returns false (0).

The truth table of the AND function
The truth table of the AND function

A perceptron that implements the AND function has two binary inputs and a bias. We want this perceptron to "fire" only when both of its inputs are "firing". This can achieved, for example, by choosing the same weight for both inputs, e.g., _w_₁ = _w_₂ = 1, and then choosing the bias to be within the range [-2, -1). This will make sure that when both neurons are firing, the net input 2 + b will be non-negative, but when only one of them is firing, the net input 1 + b will be negative (and when none of them is firing the net input b is also negative).

A perceptron that computes the logical AND function
A perceptron that computes the logical AND function

In a similar fashion, we can build a perceptron that computes the logical OR function:

A perceptron that computes the logical OR function
A perceptron that computes the logical OR function

Verify that you understand how this perceptron works!

As an exercise, try to build a perceptron for the NAND function, whose truth table is shown below:

The truth table of the NAND function
The truth table of the NAND function

Perceptrons as Linear Classifiers

The perceptron is a type of a linear classifier, since it divides the input space into two areas separated by the following hyperplane

The equation of the separating hyperplane
The equation of the separating hyperplane

The weight vector w is orthogonal to this hyperplane, and thus determines its orientation, while the bias b defines its distance from the origin.

Every example above the hyperplane (wx + b ** > 0) is classified by the perceptron as a positive example, while every example below the hyperplane (wᵗx + __ b < 0**) is classified as a negative example.

Perceptron as a linear classifier
Perceptron as a linear classifier

Other linear classifiers include logistic regression and linear SVMs (support vector machines).

Linear classifiers are capable of learning only linearly separable problems, i.e., problems where the decision boundary between the positive and the negative examples is a linear surface (a hyperplane).

For example, the following data set is not linearly separable, therefore a perceptron cannot classify correctly all the examples in this data set:

Non-linearly separable data set
Non-linearly separable data set

The Perceptron Learning Rule

The perceptron has a simple learning rule that is guaranteed to find the separating hyperplane if the data is linearly separable.

For each training sample (x, yᵢ) that is misclassified by the perceptron (i.e., oᵢyᵢ), we apply the following update rule to the weight vector:

The perceptron learning rule
The perceptron learning rule

where α is a learning rate (0 < α ≤ 1) that controls the size of the weight adjustment in each update.

In other words, we add to each connection weight wⱼ the error of the perceptron on this example (the difference between the true label yᵢ and the output oᵢ) multiplied by the value of the corresponding input xᵢⱼ and the learning rate.

What this learning rule tries to do is to reduce the discrepancy between the perceptron’s output oᵢ and the true label yᵢ. To understand why it works, let’s examine the two possible cases of a misclassification by the perceptron:

  1. The true label is yᵢ = 1, but the perceptron’s prediction is oᵢ = _ 0, i.e., *w*ᵗ*x*ᵢ + b < 0. In this case, we would like to increase the perceptron’s net input so eventually it becomes positive. To that end, we add the quantity (yᵢ – oᵢ)*x*ᵢ = *x*ᵢ_ to the weight vector (multiplied by the learning rate). This increases the weights of the inputs with positive values (where _xᵢ_ⱼ > 0), while decreasing the weights of the inputs with negative values (where _xᵢ_ⱼ < 0). Consequently, the overall net input of the perceptron increases.

  2. The true label is yᵢ = 0, but the perceptron’s prediction is oᵢ = 1, i.e., wxᵢ + b > 0. Analogously to the previous case, here we would like to decrease the perceptron’s net input, so eventually it becomes negative.This is achieved by adding the quantity (yᵢ – oᵢ)**x**ᵢ = –x to the weight vector, since this decreases the weights of the inputs with positive values (where xᵢⱼ > 0) while increasing the weights of the inputs with negative values (where xᵢⱼ < 0). Consequently, the overall net input of the perceptron decreases.

This learning rule is applied to all the training samples sequentially (in an arbitrary order). It typically requires more than one iteration over the entire training set (called an epoch) to find the correct weight vector (i.e., the vector of a hyperplane that separates between the positive and the negative examples).

According to the perceptron convergence theorem, if the data is linearly separable, applying the perceptron learning rule repeatedly will eventually converge to the weights of the separating hyperplane (in a finite number of steps). The interested reader can find a formal proof of this theorem in this paper.

The Perceptron Learning Algorithm

In practice, it may take long time for the perceptron learning process to converge (i.e., reach zero errors on the training set). Furthermore, the data itself may not be linearly separable, in which case the algorithm may never terminate. Therefore, we need to limit the number of training epochs by some predefined parameter. If the perceptron achieves zero errors on the training set before this number is reached, we can stop its training earlier.

The perceptron learning algorithm is summarized in the following pseudocode:

Note that the weights are typically initialized to small random values in order to break the symmetry (if all the weights were equal, then the output of the perceptron would be constant for every input), while the bias is initialized to zero.

Example: Learning the Majority Function

For example, let’s see how the perceptron learning algorithm can be used to learn the majority function of three binary inputs. The majority function is a function that evaluates to true (1) when half or more of its inputs are true and to false (0) otherwise.

The training set of the perceptron includes all the 8 possible binary inputs:

The training set for the majority function
The training set for the majority function

In this example we will assume that the initial weights and bias are 0, and the learning rate is α = 0.5.

Let’s track the weight updates during the first epoch of training. The first sample presented to the perceptron is x = (0, 0, 0). The net input of the perceptron in this case is: z = wx + b = 0 × 0 + 0 × 0 + 0 × 0 + 0 = 0. Therefore, its output is o = 1 (remember that the step function outputs 1 whenever its input is ≥ 0). However, the target label in this case is y = 0, so the error made by the perceptron is yo = -1.

According to the perceptron learning rule, we update each weight wᵢ by adding to it α(y – o)xᵢ = -0.5xᵢ. Since all the inputs are 0 in this case, except for the bias neuron (_x_₀ = 1), we only update the bias to be -0.5 instead of 0.

We repeat the same process for all the other 7 training samples. The following table shows the weight updates after every sample:

The first epoch of training
The first epoch of training

During the first epoch the perceptron has made 4 errors. The weight vector after the first epoch is w = (0, 0.5, 1) and the bias is 0.

In the second training epoch, we get the following weight updates:

Second epoch of training
Second epoch of training

This time the perceptron has made only three errors. The weight vector after the second epoch is w = (0.5, 0.5, 1) and the bias is -0.5.

The weight updates in the third epoch are:

Third epoch of training
Third epoch of training

After the update of the second example in this epoch, the perceptron has converged to the weight vector that solves this classification problem: w = (0.5, 0.5, 0.5) and b = -1. Since all the weights are equal, the perceptron fires only when at least two of the inputs are 1, in which case their weighted sum is at least 1, i.e., greater or equal than the absolute value of the bias, hence the net input of the perceptron is non-negative.

Perceptron Implementation in Python

Let’s now implement the perceptron learning algorithm in Python.

We will implement it as a custom Scikit-Learn estimator by extending the sklearn.base.BaseEstimator class. This will allow us to use it as any other estimator in Scikit-Learn (e.g., adding it to a pipeline).

A custom estimator needs to implement the fit() and predict() methods, and set all its hyperparameters in the init() method.

I will first show the complete code of this class, and then walk through it step-by-step.

from sklearn.base import BaseEstimator

class Perceptron(BaseEstimator):
    def __init__(self, alpha, n_epochs):
        self.alpha = alpha        # the learning rate
        self.n_epochs = n_epochs  # number of training iterations

    def fit(self, X, y):
        (n, m) = X.shape  # n is the number of samples, m is the number of features

        # Initialize the weights to small random values
        self.w = np.random.randn(m)
        self.b = 0

        # The training loop
        for epoch in range(self.n_epochs):
            n_errors = 0  # number of misclassification errors

            for i in range(n):
                o = self.predict(X[i])
                if o != y[i]:
                    # Apply the perceptron learning rule
                    self.w += self.alpha * (y[i] - o) * X[i]
                    self.b += self.alpha * (y[i] - o)
                    n_errors += 1

            # Compute the accuracy on the training set
            accuracy = 1 - (n_errors / n)
            print(f'Epoch {epoch + 1}: accuracy = {accuracy:.3f}')

            # Stop the training when there are no more errors
            if n_errors == 0:
                break

    def predict(self, X):
        z = X @ self.w + self.b
        return np.heaviside(z, 1)  # the step function

The constructor of the class initializes the two hyperparameters of the model: the learning rate (alpha) and the number of training epochs (_nepochs).

def __init__(self, alpha, n_epochs):
    self.alpha = alpha
    self.n_epochs = n_epochs

The fit() method runs the learning algorithm on a given data set X with labels y. We first find out how many samples and features we have in the data set by interrogating the shape of X:

(n, m) = X.shape

n is the number of training samples and m is the number of features.

Next, we initialize the weight vector using the standard normal distribution (with mean 0 and standard deviation of 1), and the bias to 0:

self.w = np.random.randn(m)
self.b = 0

We now run the training loop for _nepochs iterations. In each iteration, we go over all the training samples, and for each sample we check if the perceptron classifies it correctly by calling the predict() method and comparing its output to the true label:

for i in range(n):
    o = self.predict(X[i])
    if o != y[i]:

If the perceptron has misclassified the sample, we apply the weight update rule to both the weight vector and the bias, and then increment the number of misclassification errors by 1:

self.w += self.alpha * (y[i] - o) * X[i]
self.b += self.alpha * (y[i] - o)
n_errors += 1

When the epoch terminates, we report the perceptron’s current accuracy on the training set, and if the number of errors was 0, we terminate the training loop:

accuracy = 1 - (n_errors / n)
print(f'Epoch {epoch + 1}: accuracy = {accuracy:.3f}')

if n_errors == 0:
    break

The predict() method is quite straightforward. We first compute the net input of the perceptron as the dot product between the input vector and the weights plus the bias:

z = X @ self.w + self.b

Finally, we use NumPy’s heaviside() function to apply the step function to the net input and return the output:

return np.heaviside(z, 1)

The second parameter of np.heaviside() specifies what should be the value of the function for z = 0.


Let’s now test our implementation on a data set generated by the make_blobs() function from Scikit-Learn.

We first generate a data set with 100 random points divided into two groups:

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=0.5)

We set _clusterstd to 0.5 (instead of the default 1) in order to make sure that the data is linearly separable.

Let’s plot the data set:

import seaborn as sns

sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, style=y, markers=('s', 'o'), 
                palette=('r', 'b'), edgecolor='black')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
The blobs data set
The blobs data set

We now instantiate the Perceptron class and fit it to the data set:

perceptron = Perceptron(alpha=0.01, n_epochs=10)
perceptron.fit(X, y)

The output during training is:

Epoch 1: accuracy = 0.250
Epoch 2: accuracy = 0.950
Epoch 3: accuracy = 1.000

The perceptron has converged after three epochs of training.

We can plot the decision boundary found by the perceptron and the two class areas using the following function:

def plot_decision_boundary(model, X, y):
    # Retrieve the model parameters
    w1, w2, b = model.w[0], model.w[1], model.b

    # Calculate the intercept and slope of the separating line
    slope = -w1 / w2
    intercept = -b / w2

    # Plot the line
    x1 = X[:, 0]
    x2 = X[:, 1]
    x1_min, x1_max = x1.min() - 0.2, x1.max() + 0.2
    x2_min, x2_max = x2.min() - 0.5, x2.max() + 0.5
    x1_d = np.array([x1_min, x1_max])
    x2_d = slope * x1_d + intercept

    # Fill the two classification areas with two different colors
    plt.plot(x1_d, x2_d, 'k', ls='--')
    plt.fill_between(x1_d, x2_d, x2_min, color='blue', alpha=0.25)
    plt.fill_between(x1_d, x2_d, x2_max, color='red', alpha=0.25)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)

    # Draw the data points
    sns.scatterplot(x=x1, y=x2, hue=y, style=y, markers=('s', 'o'), 
                    palette=('r', 'b'), edgecolor='black')
    plt.xlabel('$x_1$')
    plt.ylabel('$x_2$')
plot_decision_boundary(perceptron, X, y)
The separating hyperplane found by the perceptron
The separating hyperplane found by the perceptron

Scikit-Learn provides its own Perceptron class that implements a similar algorithm, but provides more options such as regularization and early stopping.

Limitations of the Perceptron Model

Although the perceptron model has shown some initial success, it was quickly realized that perceptrons cannot learn some simple functions such as the XOR function:

The XOR problem cannot be solved by a perceptron
The XOR problem cannot be solved by a perceptron

The XOR problem is not linearly separable, therefore linear models such as perceptrons cannot solve it.

This revelation has caused the field of neural networks to stagnate for many years (a period known as "the AI winter"), until it was realized that stacking multiple perceptrons in layers can solve more complex and non-linear problems such as the XOR problem.

Multi-layer perceptrons (MLPs) are covered in this article.

Final Notes

All images unless otherwise noted are by the author.

You can find the code samples of this article on my github: https://github.com/roiyeho/medium/tree/main/perceptrons

Thanks for reading!


Related Articles