What is Logistic Regression?

Logistic Regression is a ubiquitous algorithm used by nearly every Data Scientist. However, despite being so well known and implemented, its origins are still not fully understood by many practictionars. In my previous article, I discussed Generalized Linear Models (GLMs) and their link to Machine Learning algorithms. I would advise the current reader to look over that article to gain a full inituition of GLMs. However, briefly put, GLMs provide a theoretical framework to where the target variable is non-normally distributed. In this article, we will derive Logistic Regression using GLMs to show exactly where it comes from.
Example Problem and Motivation
Logistic Regression aim is to assign a probability to an event occuring or a sample belonging to a certain class given some features. This is analogous to a boolean valued output.
An example problem is determining whether a student passes an exam or not. Let’s assign a pass (success) as 1 and a fail as 0. Now, let’s assume we know how long they have spent studying for their exam, call this _X_1, and whether they passed their previous exam, X_2_. Therefore, we can formulate this problem as:

Where Y is the target, that should take values between 0 and 1, and the β values are the unknown coefficients that we need to compute to fit the model. However, do you see the problem with the above equation? There is no guarentee that the output will be between 0 and 1. The time spent studying, _X_1, can take values from 0 to infinity, thus we can end up with a Y_ value greater than 1. This is not good and makes our model nonsensical.
Therefore, we need to find a way, or better a function, to meet the requirements of our target variable. This function can be found using the mathematical framework of GLMs.
Bernoulli and Binomial Distributions
The required output for the above problem is satisfied by the Bernoulli distribution. This distribution computes the probability of a certain trial with two possible outcomes, success or failure. For example, whether a coin flip will land on heads. One typically assigns the success with probability of p, hence __ failure with a probability of 1–p.
The probability mass function for the Bernoulli distribution is:

Where x is the number of successful trials and p is the probability of a successful trial.
The Bernoulli distribution is a special case of the Binomial distribution, where we have multiple trials denoted by n and therefore can have more than 1 successful trial. The probability mass function for the Binomial distribution is:

This is the same function as that for the Bernoulli distribution, except we are now multiplying by the Binomial coefficient given by:

These coefficients calculate the number of ways (permutations) of having x outcomes in n trials. These coefficients appear in many natural phenomena such as Pascal’s triangle and Combinatorics.
I also have a video detailing the Binomial distribution:
GLMs and Link Function
GLMs can be used to determine the function that ‘links’ the inputs to the required distribution outputs. The GLM theoretical framework requires that the target variable distribution is a member of the exponential family, which is given by the following probability density function:

Where, θ is referred to as the natural parameter, that ** is linked to the mean, and ** φ is the scale parameter, which is linked to the variance. Furthermore, a(φ), b(θ) and c(y,φ) are additional functions that would need to be calculated.
It can be mathematically derived that the mean, E(Y), and variance, Var(Y), for the exponential family are governed by:


These formulae are just shown for completeness and are not necessary for this derivation. Again, this theoretical framework is explained in more detail in my previous article.
Link Function For Binomial Distribution
Indeed, the Binomial distribution is a member of the exponential family and can be written in the required format as:

Through matching coefficients of the Binomial formula with the exponential family formula, we conclude that:

Do you recognise this equation? The above function is known as the Logit function and is the conical link function for the Binomial/Bernoulli distribution. For reference, the p value above is the probability that the ouput variable Y is equal to 1, p = P(Y=1).
Therefore, for a target variable with a Binomial/Bernoulli distribution the mathematically derived linked function is the Logit function. This is why it is called Logistic Regression!
Referring back to the problem we set above on whether a student passes their exam or not. We can now adapt our previous equation using the Logit function:

Rearranging:

We have derived the famous Sigmoid function! This new equation now ensures that no matter what values our inputs take, the output will always be between 0 and 1!
Conclusion
I hope you enjoyed the above article and gained some insight into the origins of Logistic Regression. I have omitted quite a bit of mathematical detail as some of the derivations are quite exhaustive! Therefore, feel free to explore this topic further to gain a better intuition!
Another Thing!
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.