Understanding Logistic Regression — the Odds Ratio, Sigmoid, MLE, et al

Published in

Towards Data Science

9 min readOct 21, 2022

Logistic regression is one of the most frequently used machine learning techniques for classification. However, though seemingly simple, understanding the actual mechanics of what is happening — odds ratio, log transformation, the sigmoid — and why these are used can be quite tricky. In this blog I will explain Logistic Regression under the hood, mostly intuitively, but at times with a teeny amount of maths.

This post will be focusing on the how and why of Logistic Regression. For its use in classification and classification evaluation, check out my post here.

One of the points of confusion that I have always had is how do we go from something called the odds ratio to the actual probability, as that is what we typically estimate. So, this is a short post about the derivation of the logistic regression equation, why it is transformed into the infamous ‘odds ratio’ and how it is transformed back and how it is estimated.

So, let’s begin..

When to use

Suppose we have to classify patients into two groups on the basis of blood test results, BMI, BP. The classes are: diabetic and non diabetic. this is a typical classification problem. We have a binary dependent variable such that:

= 1 if diabetic

= 0 if non diabetic

The equation that we would estimate would be:

Where y = 1 if diabetic and y = 0 if non diabetic. The x variable stands for various input features such as test scores, BMI, etc. Currently we have only one x, say, test score.

What is the challenge with this?

This is a typical linear regression problem and we can estimate it accordingly by minimizing the sum of squared errors. However, unlike regression where we predict a number such as sales. Here we are trying to predict a category which is capped by the values 1 or 0. If we get a value near 1, say, 83, we can round it off to 1 and similarly .33 can be classified as a 0. Hence effectively this equation predicts the probability of a particular instance belonging to class 1 or 0. Generally, some externally defined cut off such as .5 or .75 will be used to define the predicted score. The Logistic regression classification process can be visualized in the image. Read more about classification thresholds in this post.

The Classification Process (Image source: my collection)

But, a key problem with this is that Linear regression is unbounded. This means that, for example, for certain independent variable values the predicted y value can be greater than 1 or even negative. This clearly makes no sense in a classification problem.

Let us take a simple example from Wikipedia. We have student pass-fail data plotted with respect to no. of hours of study. Pass-Fail is a binary variable that can take values 1or 0. Hours of study is numeric. We can plot this as a standard graph with a predicted line. We plot the no. of hours studied on the x axis and the probability (p(y|x)) of pass or fail given no. of hours studies on the y axis.

Plotting Linear Regression on Categorical Data ( Image source: my collection)

We observe the following: as per predicted line, if no. of hours exceeds 5, our predicted value goes beyond 1. Similarly, if no. of hours studied is less than half an hour, the predicted y, which in this case is p(y=1), becomes negative.

These results clearly make no sense and we need a way to estimate an equation that has an upper bound of 1 and a lower bound of 0, irrespective of the value of x.

The case for Logistic Regression

This brings Logistic regression into picture. Logistic regression sounds very similar to linear regression. It is essentially a transformation of the linear regression into an equation such that it has limiting values of 0 and 1. We still want to use linear regression to solve this problem, but somehow we need to make sure that the function remains within its limits. What we wish to estimate is the following equation which is linear.

P(y=1)=ax + b

There are two problems with the linear equation:

It is unbounded as discussed above
In many situations we see that changing p by the same amount requires a bigger change in x when p is already large (or small) than when p is close to 1/2. This is a non linear relationship and something that a linear model cannot do.

Now the way this problem is solved is via a logistic transformation. Normally algebra is something which adds to confusion rather than clarity. But in this case a little bit of algebra helps us see what the logistic regression equation actually means, and that it is not something that has just been conjured up.

The Logistic equation is as below. Clearly this needs explaining…

Where P(y=1)= a + bx

Let’s start explaining.

Deriving the Logistic Regression Equation

As a first step we need to transform p(y=1) so that its limits cannot be negative or infinity. Going forward, and for simplicity, we denote p(y=1) as p. The transformation of the linear equation is done by taking the odds ratio.

You will now groan and ask, ‘what is the odds ratio?’.

The odds are ratios of something happening, to something not happening (i.e. 3/2 = 1.5).
The probabilities are ratios of something happening, to everything what could happen (3/5 = 0.6).

(Zablotski, 2022)

So if the probability of an event happening is .6, the odds ratio is .6/.4 =1.5. It tells us the odds of an event happening vs not happening. The odds ratio is defined as:

Now, this ratio has limiting values of 0 at the lower end. At the upper end it is undefined ( division by zero in the denominator when p=1). The odds ratio can take any value between 0 and is unbounded at the upper end. This ratio has a value of 1 in the middle, indicating a probability of .5 for both occurrence and non occurrence. A small range of odds, from 0 to 1, have a higher probability of failure than for success. Then there is an infinite range of odds, from 1 to infinity, which shows higher probability of success than of failure.

Due to the unbalanced ranges and to centralize the odds ratio around 0, we need to take a logarithmic transformation of the odds ratio. This helps the ranges of the odds ratio to become symmetric around 0, i.e., go from -infinity to +infinity. This is shown in the plot below.

This transformation of log of odds is also known as the Logit function and is the basis of the Logistic Regression. The symmetry attained via this transformation improves the interpretability of log odds, with a negative value indicating the odds of failure and a positive value showing higher chances of success.

Notice that the domain of the function only stretches from 0 to 1 but is not actually defined at 0 or 1. This is because if we substitute in p=1 in the in the log odds expression, it would result in division by zero, yielding an undefined value. Similarly, while substituting p=0 would result in evaluating the log of zero which is similarly undefined (Zablotski, 2022).

We assume that this log-odds of an observation y can be expressed as a linear function of input variables x.

Going from odds to probability

Now, we have the log odds equation. We are fairly clear that this has to be estimated in some way. However, the question you might ask now is: ‘but the dependent variable is log odds, whereas I need a simple probability, p. How do I backtrack without very complicated calculations!’

From log odds to p (image source: my collection)

Don’t worry. This is the reason the Logistic transformation was used. We shall see with a few steps of algebra that we can get back to our good old understandable probability, p.

Step 1 We first exponentiate both sides:

Step 2 Some simple algebraic manipulations ( putting in detailed steps for the algebraically challenged)

Consolidating terms on the LHS

Finally we can divide both numerator and denominator to get:

This can be expressed as the functional form of the Logistic regression equation that we are familiar with which is also called the Sigmoid.

This is our standard logistic regression equation that transforms a linear regression to give the probability of getting a positive in terms of various dependent variables.

We can plot the logistic regression equation and it gives an S shaped curve with .5 as the mid point and a probability of 1 and 0 as limiting values.

Logistic regression plot(image source: my collection)

In the plot above I have plotted the probability of an event y given x. This is termed as a Sigmoid function. The x is presented as normalized. As you can see the probability values are capped by 0 at the lower end of the graph and 1 at the upper end. The midpoint is a probability of .5.

How is this equation estimated?

Now that we understand the curve and the sigmoid (the difficult part), we can briefly look at how the Logistic models is estimated. As we have seen the Logistic model is a non linear transformation of the linear regression model. The linear model is usually fitted by minimizing the sum of squared errors via optimization. However, it is not possible to fit Logistic regression by standard optimization techniques such as the normal equations, hence we use maximum likelihood estimation(MLE). The MLE is also a method that is based on the data. It tries to solve the problem: given the data that we have, what are the model parameters that maximize the likelihood of observing the data that we have observed.

A simple example to illustrate the basic idea:

Suppose we have a box with blue and green balls. We do not as of now know the distribution of the colors. However we pull out 10 random samples with replacement and get 8 blue and 2 green balls. The question that MLE asks is: what is the distribution of blue and green balls in the box, that in repeated rounds of 10 draws, would give a result such as 8 blue and 2 green? We can try with various underlying distributions to test which underlying distribution would give us a result of 8 blue and 2 green. For example, a distribution of 50:50 for both colors will most likely give a very low probability for getting a 8 blue and 2 green result. The probabilities are likely to be higher the closer is the underlying distribution to the drawn one: for example 7 blue and 3 green, 9 blue and 1 green, etc.

Effectively this is what maximum likelihood estimation does for the logistic regression equation. Delving into the details of MLE requires a separate post, so I will save that for another time. However, this article should have provided you with an intuitive understanding of Logistic Regression theoretical foundations. Theoretical foundations are useful in machine learning when we have to explain the rationale of what we have done and how this algorithm is different from some other. In short the ‘intelligent’ part of machine learning.

If you liked this article, please follow up with my article on Classification Evaluation.

Thanks for reading and let me know your comments below!

References

Understanding Logistic Regression — the Odds Ratio, Sigmoid, MLE, et al

When to use

Written by Shailey Dash