The world’s leading publication for data science, AI, and ML professionals.

The Essence of Logistic Regression

An explanation of the origins of Logistic Regression using Generalised Linear Models

What is Logistic Regression?

Photo by Brett Jordan on Unsplash
Photo by Brett Jordan on Unsplash

Logistic Regression is a ubiquitous algorithm used by nearly every Data Scientist. However, despite being so well known and implemented, its origins are still not fully understood by many practictionars. In my previous article, I discussed Generalized Linear Models (GLMs) and their link to Machine Learning algorithms. I would advise the current reader to look over that article to gain a full inituition of GLMs. However, briefly put, GLMs provide a theoretical framework to where the target variable is non-normally distributed. In this article, we will derive Logistic Regression using GLMs to show exactly where it comes from.

Example Problem and Motivation

Logistic Regression aim is to assign a probability to an event occuring or a sample belonging to a certain class given some features. This is analogous to a boolean valued output.

An example problem is determining whether a student passes an exam or not. Let’s assign a pass (success) as 1 and a fail as 0. Now, let’s assume we know how long they have spent studying for their exam, call this _X_1, and whether they passed their previous exam, X_2_. Therefore, we can formulate this problem as:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

Where Y is the target, that should take values between 0 and 1, and the β values are the unknown coefficients that we need to compute to fit the model. However, do you see the problem with the above equation? There is no guarentee that the output will be between 0 and 1. The time spent studying, _X_1, can take values from 0 to infinity, thus we can end up with a Y_ value greater than 1. This is not good and makes our model nonsensical.

Therefore, we need to find a way, or better a function, to meet the requirements of our target variable. This function can be found using the mathematical framework of GLMs.

Bernoulli and Binomial Distributions

The required output for the above problem is satisfied by the Bernoulli distribution. This distribution computes the probability of a certain trial with two possible outcomes, success or failure. For example, whether a coin flip will land on heads. One typically assigns the success with probability of p, hence __ failure with a probability of 1–p.

The probability mass function for the Bernoulli distribution is:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

Where x is the number of successful trials and p is the probability of a successful trial.

The Bernoulli distribution is a special case of the Binomial distribution, where we have multiple trials denoted by n and therefore can have more than 1 successful trial. The probability mass function for the Binomial distribution is:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

This is the same function as that for the Bernoulli distribution, except we are now multiplying by the Binomial coefficient given by:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

These coefficients calculate the number of ways (permutations) of having x outcomes in n trials. These coefficients appear in many natural phenomena such as Pascal’s triangle and Combinatorics.

I also have a video detailing the Binomial distribution:

GLMs and Link Function

GLMs can be used to determine the function that ‘links’ the inputs to the required distribution outputs. The GLM theoretical framework requires that the target variable distribution is a member of the exponential family, which is given by the following probability density function:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

Where, θ is referred to as the natural parameter, that ** is linked to the mean, and ** φ is the scale parameter, which is linked to the variance. Furthermore, a(φ), b(θ) and c(y,φ) are additional functions that would need to be calculated.

It can be mathematically derived that the mean, E(Y), and variance, Var(Y), for the exponential family are governed by:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.
Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

These formulae are just shown for completeness and are not necessary for this derivation. Again, this theoretical framework is explained in more detail in my previous article.

Link Function For Binomial Distribution

Indeed, the Binomial distribution is a member of the exponential family and can be written in the required format as:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

Through matching coefficients of the Binomial formula with the exponential family formula, we conclude that:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

Do you recognise this equation? The above function is known as the Logit function and is the conical link function for the Binomial/Bernoulli distribution. For reference, the p value above is the probability that the ouput variable Y is equal to 1, p = P(Y=1).

Therefore, for a target variable with a Binomial/Bernoulli distribution the mathematically derived linked function is the Logit function. This is why it is called Logistic Regression!

Referring back to the problem we set above on whether a student passes their exam or not. We can now adapt our previous equation using the Logit function:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

Rearranging:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

We have derived the famous Sigmoid function! This new equation now ensures that no matter what values our inputs take, the output will always be between 0 and 1!

Conclusion

I hope you enjoyed the above article and gained some insight into the origins of Logistic Regression. I have omitted quite a bit of mathematical detail as some of the derivations are quite exhaustive! Therefore, feel free to explore this topic further to gain a better intuition!

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!


Related Articles