From the Perceptron to Adaline

Setting the foundations right

Pan Cretan
Towards Data Science

--

Photo by Einar Storsul on Unsplash

Introduction

In a previous article I tried to explain the most basic binary classifier that has likely ever existed, Rosenblatt’s perceptron. Understanding this algorithm has educational value and can serve as a good introduction in elementary machine learning courses. It is an algorithm that can be coded from scratch in a single afternoon and can spark interest, a sense of achievement and motivation to delve into more complex topics. Still, as an algorithm it leaves much to be desired because convergence is only guaranteed when the classes are linearly separable that is often not the case.

In this article we will continue the journey on mastering classification concepts. A natural evolution from the Rosenblatt’s perceptron is the adaptive linear neuron classifier, or adaline as it is colloquially known. Moving from the perceptron to adaline is not a big leap. We simply need to change the step activation function to a linear one. This small change leads to a continuous loss function that can be robustly minimised. This allows us to introduce many useful concepts in machine learning, such as vectorisation and optimisation methods.

In future articles we will also cover further subtle changes of the activation and loss functions that will take us from adaline to logistic regression, that is already a useful algorithm in daily practice. All of the above algorithms are essentially single layer neural networks and can be readily extended to multilayer ones. In this sense, this article takes the reader a step further through this evolution and builds the foundations to tackle more advanced concepts.

We will need some formulas. I used the online LaTeX equation editor to develop the LaTeX code for the equation and then the chrome plugin Maths Equations Anywhere to render the equation into an image. The only downside of this approach is that the LaTeX code is not stored in case you need to render it again. For this purpose I provide the list of equations at the end of this article. If you are not familiar with LaTex this may have its own educational value. Getting the notation right is part of the journey in machine learning.

Adaptive linear neuron classifier (adaline)

So what is the adaline algorithm? Adaline is a binary classifier as the perceptron. A prediction is made by using a set of input values for the features [x₁, .. , xₘ] where m is the number of features. The input values are multiplied with the weights [w₁, .. , wₘ] and the bias is added to obtain the net input z = w₁x₁ + .. + wₘxₘ + b. The net input is passed to the linear activation function σ(z) that is then used to make a prediction using a step function as with the perceptron:

A key difference with the perceptron is that the linear activation function is used for learning the weights, whilst the step function is only used for making the prediction at the end. This sounds like a small thing, but it is of significant importance. The linear activation function is differentiable whilst the step function is not! The threshold 0.5 above is not written in stone. By adjusting the threshold we can regulate the precision and recall according to our use case, i.e. based on what is the cost of false positives and false negatives.

In the case of adaline the linear activation function is simply the identity, i.e. σ(z) = z. The objective function (also known as loss function) that needs to be minimised in the training process is

where w are the weights

and b is the bias. The summation is over all of the examples in the training set. In some implementations the loss function also includes a 1/2 coefficient for convenience. This cancels out once we take the gradients of the loss function with respect to the weights and bias and, as we will see below, has no effect other than scaling the learning rate by a factor of 2. In this article we do not use the 1/2 coefficient.

For each example, we compute the square difference between the calculated outcome

and the true class label. Note that the input vector is understood to be a matrix with shape (1, m), i.e. as we will see later is one row of our feature matrix x with shape (n, m).

The training is nothing else than an optimisation problem. We need to adjust the weights and bias so that the loss function is minimised. As with any minimisation problem we need to compute the gradients of the objective function with respect to the independent variables that in our case will be the weights and the bias. The partial derivative of the loss function with regard to the weight wⱼ is

The last row introduces important matrix notation. The feature matrix x has shape (n, m) and we take the transpose of its column j, i.e. a matrix with shape (1, n). The true class labels y is a matrix with shape (n, 1). The net output of all samples z is also a matrix with shape (n, 1), that does not change after the activation that is understood to apply to each of its elements. The final result of the above formula is a scalar. Can you guess how we could express the gradients with respect to all weights using the matrix notation?

where the transpose of the feature matrix has shape (m, n). The end result of this operation is a matrix with shape (m, 1). This notation is important. Instead of using loops, we will be using exactly this matrix multiplication using numpy. In the era of neural networks and GPUs, the ability to apply vectorization is essential!

What about the gradient of the loss function with respect to the bias?

where the overbar denotes the mean of the vector under it. Once more, computing the mean with numpy is a vectorised operation, i.e. summation does not need to be implemented using a loop.

Once we have the gradients we can employ the gradient descent optimisation method to minimise the loss. The weights and bias terms are iteratively updated using

where η is a suitable chosen learning rate. Too small values can delay convergence, whilst too high values can prevent convergence altogether. Some experimentation is needed, as is generally the case with the parameters of machine learning algorithms.

In the above implementation we assume that the weights and bias are updated based on all examples at once. This is known as full batch gradient descent and is one extreme. The other extreme is to update the weights and bias after each training example, that is known as stochastic gradient descent (SGD). In reality there is also some middle ground, known as mini batch gradient descent, where the weights and bias are updated based on a subset of the examples. Convergence is typically reached faster in this way, i.e. we do not need to run as many iterations over the whole training set, whilst vectorisation is still (at least partially) possible. If the training set is very large (or the model is very complex as is nowadays the case with the transformers in NLP) full batch gradient descent may simply be not an option.

Alternative formulation and closed form solution

Before we proceed with the implementation of adaline in Python, we will make a quick digression. We could absorb the bias b in the weight vector as

in which case the net output for all samples in the training set becomes

meaning that the feature matrix has been prepended with a column filled with 1, leading to a shape (n, m+1). The gradient with regard to the combined weights set becomes

In principle we could derive a closed form solution given that at the minimum all gradients will be zero

In reality the inverse of the matrix in the above equation may not exist because of singularities or it cannot be computed sufficiently accurately. Hence, such closed form solution is not used in practice neither in machine learning nor in numerical methods in general. Still, it is useful to appreciate that adaline resembles linear regression and as such it has a closed form solution.

Implementing adaline in Python

Our implementation will use mini batch gradient descent. However, the implementation is flexible and allows optimising the loss function using both stochastic gradient descent and full batch gradient descent as the two extremes. We will examine the convergence behaviour by varying the batch size.

We implement adaline using a class that exposes a fit and a predict function in the usual scikit-learn API style.

Upon initialisation the adaline classifier sets the batch size for the mini batch gradient descent. If batch size is set to None, the whole training set is used (full batch gradient descent), otherwise the training set is used in batches (mini batch gradient descent). If the batch size is one we essentially revert to stochastic gradient descent. The training set is shuffled before each pass through the training set to avoid repetitive cycles, but this only has an effect if mini batch gradient descent is used. The essence of the algorithm is in the _update_weights_bias function that carries out a full pass through the training set and returns the corresponding loss. This function applies the gradient descent with the analytically computed gradients using the derivations as in the previous section. Note the use of the numpy matmul and dot functions that avoid the use of explicit loops. If the batch_size is set to None then there are no loops whatsoever and the implementation is fully vectorised.

Using adaline in practice

We make the necessary imports and create a synthetic dataset as in the earlier perceptron article

that produces

Scatterplot of the two classes in the synthetic dataset. Image by the Author.

The only difference with the earlier article is that we tweaked the gaussian means and covariances so that the classes are not linearly separable as we would expect adaline to overcome this. Moreover, the two independent variables have on purpose different scales to discuss the importance of feature scaling.

Let’s try to fit a first model and visualise convergence. Prior to fitting we normalise the features so that they both have zero mean and unit standard deviation

This produces the convergence plot

Adaline convergence. Image by the Author.

Adaline slowly converges, but the loss function does not become zero. In order to verify the successful training we visualise the decision boundary using the same approach as in the earlier article

that produces

Decision boundary of the fitted adaline model. Image by the Author.

There are some misclassified points given that the two classes in the training set were not linearly separable and we used a linear decision boundary. Still, the algorithm converged nicely. The solution is deterministic. With sufficient number of passes through the training set we obtain numerically equal weights and bias, regardless of their initial values.

Mini batch vs. full batch gradient descent

The above numerical experiment used full batch gradient descent that partially explains the slow convergence. We will use the same dataset and random state as before, but this time we will fit the adaline classifier using different batch sizes, ranging from 20 to 400 that is the number of examples in our training set.

that produces

Effect of batch size on convergence (using a 0.001 learning rate). Image by the Author.

We can clearly see that the smaller the batch size the faster the convergence, but there are also some oscillations. These oscillation may destabilise the convergence with larger learning rates. If we double the learning rate to 0.002 this becomes evident

Effect of batch size on convergence (using a 0.002 learning rate). Image by the Author.

Increasing the learning rate further will eventually prevent convergence with the smaller batch sizes. With even larger learning rates even the full batch gradient descent will fail to converge as we would overshoot the global minimum.

Conclusions

Adaline is a significant improvement over the perceptron. The weights and bias are obtained via the minimisation of a continuous loss function that in addition is convex (and hence does not have local minima). With a sufficient small learning rate the algorithm converges even if the classes are not linearly separable. When using gradient descent in any of its variants the convergence rate is affected by the scaling of the features. In this article we used simple standardisation that shifts the mean of every feature to become zero, whilst the spread is adjusted to unit variance. In this way it is possible to select a learning rate that works well for all weights and bias, meaning that the global minimum can be obtained in fewer epochs.

Obtaining a good understanding on how to build a binary classifier using vectorisation is key before delving into more complex topics, such as support vector machines and multilayer neural networks. In daily practice, one would use scikit-learn that offers advanced classification algorithms that allow for nonlinear decision boundaries, whilst supporting efficient and systematic hyper parameter tuning and cross validation. However, building simple binary classifiers from scratch offers a deep understanding, increases confidence and gives a sense of ownership. Although building everything from scratch is of course not realistic, deeply understanding the simpler algorithms provides the necessary skills and insights so that more advanced algorithms included in off-the-shelf libraries feel less opaque.

LaTeX code of equations used in the article

The equations used in the article can be found in the gist below, in case you would like to render them again.

--

--

I am an engineer who turned into a data analyst. I enjoy learning and sharing knowledge with experts in data analysis and modelling.