Easy Logistic Regression with an Analytical Solution

Binary Classification from Scratch in Python

Casey Whorton
Towards Data Science

--

There is no shortage of articles, videos and tutorials on logistic regression for classification. It’s a classic subject in Machine Learning and is usually a stepping stone before moving on to more complex algorithms. What this article aims to do is show you logistic regression from another lens, where we can solve for a formulaic solution to the weights that we pass to a model that returns the predicted probability. I provide links to the code and solution in the article.

The problem that Logistic Regression aims to tackle is that of finding the probability of an observation of a set of features belonging to a certain class. By training on examples where we see observations actually belonging to certain classes (this would be the label, or target variable), our model will have a good idea of what a new observation’s class will be. Exactly how the model is trained is not always discussed, but I want to discuss a case where a simple rule can determine those probabilities.

Don’t you dare bring up Bayes’ Theorem

We couldn’t do it without thee:

Remember, a classification problem can be thought of as a conditional probability: The probability of a certain class label, given a set of features.

A starting block for this article, you just need to know that the conditional probability of a random variable, in this case Y, taking a certain value given X takes this form:

The denominator is a result of the “Law of Total Probability”, which is a useful tool in probability theory

Sigmoid Function

You should already know that the function that returns the predicted probability of our vector of x values falling into one class is called the sigmoid function. The convenience of the function is that it is bounded between 0 and 1 for all real values, so no matter what you put in (sans errors) you get a predicted probability.

What’s interesting here is that, after dividing the Bayes’ theorem equation above by the P(X|Y=1)*P(Y=1), we get a form that resembled the sigmoid function, and we can solve for what should be in the exponent below:

Even more interesting, is how the weights and biases are equal to a combination of the probabilities of Y being 0 and 1. So, its almost like if we knew how to represent these probabilities with our dataset, we could find a solution to what the weights should be!

Equal Covariances

Like many base-case solutions to machine learning problems, assumptions are made about the data. For examples, independence of observations and approximately normal distribution of data are assumptions used in linear regression.

In my experience, real-world data rarely meets all the underlying assumptions in order to use the textbook or out-of-the-box solution. So much so, that I often need to go back and look up basic assumptions as a reminder.

With that being said, an underlying assumption about the solution in this article is that both of the bivariate normal distributions have the same covariance matrix. Their spread is the same amount, and in the same directions. If they were unequal, there would be some consequences to our proposed solution.

Unexplained Leaps in Logic

There are some algebraic and matrix manipulations to get this point, but I am including a link here to the entire derivation for anybody that wants to see it.

It looks pretty messy right now, but it gets better. Focus your attention on the first term in the natural logarithm, this is the probability density function of a bivariate normal distribution. What I find interesting here is that the class a set of (x1,x2) belongs to is, intuitively, the difference in the two probability density functions for a given set of (x1,x2). This totally makes sense, right?

We’re still looking for is something from these bivariate normal distributions that can tell us, for a given pair of (x1,x2), what the weights should be that separate the two classes. So, we need something that multiplies with our x vector. See the link for all the steps to get here, but in the end there is an analytical solution that says: from the data, and the distribution of the two classes, this bias and these weights are the best at separating the classes.

What is so great about this is that you can see an example of how, for simple classification problems, empirical data can be used to solve a problem without a complex algorithm. Even better, with Python we can visualize a solution to a binary classification problem to illustrate the concept.

Illustrating the Point with Python

If we visualize our data well, we can see that there is a sort of boundary between these two classes. It isn’t super well defined, but the boundary can be defined in terms of x1 and x2, conveniently the features we are using. This means that, based on the values of any x1 and x2 in R2, we can find which class the combination should be in.

Both bivariate normal, so imagine a ‘peak’ in the center of each cluster coming towards you (on the z-axis)

Normally, we would use something like sci-kit learn to instantiate a Logistic Regression object, fit it to the data, and use the fitted model to make predictions on new observations. I mean, usually the data has unequal covariances, so why not benefit from the algorithm that handles it? Here, we are defining a much simpler problem for illustrative purposes.

Below you can see that we can find the weights just using the analytical solution and what we know about each distribution of clusters.

Now that we have our weights, we’ll make a design matrix that would normally use in a fitting method, make predictions on each observation by passing the dot product of the features and the weights to the sigmoid function, and return everything in a table (a pandas DataFrame).

After returning the predicted probabilities for each (x1,x2) combination in our design matrix, plotting them shows us that for the combinations that are closer to the boundary between the two classes have probabilities closer to 0.5. Having a predicted probability close to 0 would mean that the (x1,x2) is definitely in the cluster for ‘Negative’, similarly for probabilities close to 1 and the “Positive” cluster.

Well, I hope you had some fun reading this. If you’re interested in seeing the full code, it can be found here. There, you can see the formula for plotting the discriminating line, as well as a link to further details on the derivation.

--

--

Data Scientist | British Bake-Off Connoisseur| Recovering Insomniac | Heavy Metal Music Advocate