Logistic Regression

Theory and intuition behind logistic regression and implementing that using Python code

Published in

Towards Data Science

11 min readJul 9, 2020

This is a part of a series of blogs where I’ll be demonstrating different aspects and the theory of Machine Learning Algorithms by using math and code. This includes the usual modeling structure of the algorithm and the intuition on why and how it works, using Python code.

By the end of this blog you’ll know:

How Logistic Regression works both mathematically and how to code it.
Why Logistic Regression is a linear classifier.
How to evaluate the model you made.

Introduction

Logistic Regression is one of the first algorithms that is introduced when someone learns about classification. You probably would have read about Regression and the continuous nature of the predictor variable. Classification is done on discrete variables, which means your predictions are finite and class-based like a Yes/No, True/False for binary outcomes. However, simply guessing “Yes” or “No” is pretty crude. Something which takes noise into account, and doesn’t just give a binary answer, will often be useful.

In short, we want probabilities, which means we need to fit a stochastic model. What would be nice, in fact, would be to have a conditional distribution of the response Y, given the input variables, P(Y|X ). So, If our model says that there’s a 51% chance of rain and it doesn’t rain, that’s better than if it had said there was a 99% chance of rain (though even a 99% chance is not a sure thing). This is the reason why it’s called Logistic Regression and not classification because it predicts probabilities that are continuous (but bounded).

Pretty neat, right? But, you must be thinking even if the outcomes are finite, linear regression might handle it. There’s a comprehensive answer to this here, It’s quite intuitive, whenever you try to fit a Regression hypothesis curve to a discrete data and introduces an outlier, the line will try to fit the outlier and by making so you need to change your hypothesis threshold to a lesser value otherwise your predictions will get crooked.

Let’s now get to the fun part…

Modeling

We have a binary output variable Y and we want to model the conditional probability P(Y = 1|X = x) as a function of x. Now, Logistic Regression belongs to the Generalized Linear Models (GLMs) family of learning. So the question arises “how can we use linear regression to solve this?”

The idea is to let P(x) be a linear function of x. Every change in x would affect the probability. The conceptual problem here is that P must be between 0 and 1, and linear functions are unbounded. Moreover, in many situations, we may see “diminishing returns” — changing P by the same amount requires a bigger change in x when P is already large (or small) than when p is close to 1/2. Linear models can’t do this.
The next best idea is to let log(P(x)) be a linear function of x so that changing an input variable multiplies the probability by a fixed amount.

As you can see above, logarithms are bounded in only one direction. This means that a change in x in a positive direction may not significantly impact the result as compared to a negative change will.

3. Finally, the best modification of log(P(x)) which has bounded range on both sides is the logistic (or logit) transformation, log (P(x)/(1−P(x)). This is also the log(odds) of an event representing the log of the ratio of success to failure. This squiggly line which is also called the sigmoid curve will be our hypothesis for this model. The bounded nature can be seen in the graph of the logit curve:

So, in Linear Regression or OLS our hypothesis is:

Eq: 1.1

and if we equate newly found transformation in order to bound and project probabilities instead of continuous outcome, with the above equation, we get:

solving for P we’ll get:

Now in order to identify the class, we can assume a threshold value (=0.5) and assign accordingly:

Now let’s get some coding done, starting with visualizing the Probability space and the predictor space.

Importing Modules

for the purpose of the demonstration of this algorithm, we’ll be using Iris Dataset which is a popular starter classification dataset. Let’s import and utilize only 2 class (3 classes are present).

For keeping the data for the maximum of 3-Dimension to visualize every step we’ll only consider sepal length and sepal width and the label of course.

After running the above script you’ll see something like this:

This shows your probability space for the features belonging to class 0 and 1.

Decision Boundary

Eq:1.3 and 1.4 means guessing 1 whenever (β0 + x 1·β1+x2.β2) >0, and 0 otherwise. So logistic regression gives us a linear classifier. The decision boundary separating the two predicted classes is the solution of —

Eq: 1.5

which is a point in 1 dimension, line in 2-D(this case), and so on... The distance of a point X from the decision boundary can be calculated too.

Eq: 1.6 also represents the equation of the decision boundary. Let’s now see the predictor space.

After running the above script you’ll see something like this:

As you see can probably identify that there exists a boundary that can divide this space into two parts belonging to classes 0 and 1 accordingly. Logistic Regression will make use of the probability as well as predictor space (above) to build a linear decision boundary between classes 0 and 1.

Now we’ve finished the modeling part. The parameters we want to optimize are β0,β1,β2. To do this we’ve got a pretty neat technique up our sleeves.

Maximum Likelihood Estimation

This is a go-to strategy to maximize the likelihood of parameters belonging/fitting the data correctly. This is the same strategy that a lot of other statistical approaches use to optimize parameters. The difference between this and linear regression is that we cannot utilize the same (Residual)² approach here. We’ll see why. To approach this:

Visualize all the points in log(odds) space. This means you’ve to consider Eq: 1.2. Initialize β0,β1,β2 to some random values, and make a candidate fit line(plane for 2d) for the log(odds) space.

Now project the data-points onto the line and calculate the Likelihood of all the points.

Note that here the probability is not calculated as Area under the curve (in probability space) but instead is the axis value so it’s same as the likelihood.

The total likelihood of the candidate line will be the product of all the individual likelihoods as shown in the equation below.

Note: While calculating the likelihood of class 0 the likelihood would be calculated as (1-P(x))

Now our objective is to maximize this Likelihood Function w.r.t the parameters. To achieve this we need to differentiate this but the problem is differentiating an f(x).g(x) requires by-parts which becomes difficult to handle. So we’ll convert this equation to log-likelihood and solve that.
After log-transformation and rearranging variables, you’ll see something like this:

Now by substituting the values using Eq:1.2,1.3 and rearranging variables, you’ll see:

The above equation is far easier to differentiate than Eq:1.7. Now, we have to differentiate the above equation w.r.t β0,β1,β2 to get the optimal values. So, we’ll differentiate to generalize between the three:

Note that the above equation is a Transcendental Equation which doesn’t have a closed-form solution. So, we can’t approach this with python code, so we’ll solve this numerically.

You must be thinking if we were to solve this numerically then what was the point in taking the pain to understand all this. We’ve to understand here that MLE is the basis of all optimization algorithms and widely used too. It’s simple it’s effective and we’ll still be using the log-likelihood in our Numerical Methods.

Although numerous Numerical Methods exist to solve this, like Newton’s Method for Numerical Optimization. For this blog, we’re gonna use our good old Gradient Descent. Continuing the code:
First, define our Cost Function which is nothing but our log-likelihood function.

Now define the Gradient Descent Function.

After this, we’ll do some matrix operations to prepare the input for the model.

and then the training part…

In the output of the above code, you’ll see the optimized parameters from the model.
Let’s now see how the model converges based on data and other hyper-parameters.

After running the above script we can see the Cost of the model reducing non-linearly as iterations move forward.

The decision boundary for the Logistic Model, we just build can be visualized using the code below.

Note that the slope and intercept of the line can be calculated using the Eq: 1.6.

running the above script will plot the following:

Voila! we just made a linear classifier on the Iris Dataset using Logistic Regression.
Now, we have the best-fit decision boundary and squiggle (sigmoid) for this data but how do you know if it is any useful? how do we evaluate something like this?

R²- coefficient of determination and p-value

In the case of Generalized Linear Models: Linear Regression/OLS we achieve this by calculating the R²-coefficient of determination and it’s p-value for significance. If you recall, the R²-coefficient of determination in Linear Regression is calculated using (Residuals)², but in the case of classification using Logistic Regression this method doesn’t make sense as in the log(odds) dimension the data-points are pushed to +∞ and -∞ so residuals make no sense.

The solution to this is McFadden’s Psuedo R². This method is very similar to the OLS’s R² so it’s super easy to understand.

Let’s first quickly recap the OLS-R². It gives you a comparative coefficient value of how much of the total variation in Y(target) is described by the variation in X(line).

SE= Squared Error

In other words, it's just a comparative result of the worst fit(SE_mean) and the best fit(SE_line).

Now let’s talk about R² in terms of Logistic Regression. Just like linear regression we first need to find the best fit to compare it with a bad fit. Now, the log-likelihood of your best fit line in the log(odds) space (see fig:6) will represent LL(fit) to fill in for SE(line) in the Eq: 1.11.

The mystery here is to calculate is the bad fit line. Don’t worry it’s quite intuitive too. In Linear Regression modeling our worst fit was the y=mean(y). In this case, we’ll do quite a similar thing.

First, calculate the worst fit line by ignoring the other features and simply taking the log(ratio of samples) as stated in Eq:1.12 and transform it into the Probability space.
Now, calculate the sum of all log-likelihoods of the points of that worst fit line. This will give you LL(overall probability).
Surprisingly, the total log-likelihood calculated with the above method is similar to calculating —> P=(Total Data with class=1)/Total Data.

Now we have LL(overall probability)->Measeure of bad fit
and LL(fit)-> Measure of best fit. So you R² will be:

R² will be in the range [0,1], 0 representing the worst fit, and 1 representing the best fit.

P-value

Now, we’ll calculate the significance of the R² using the p-value. Calculating the p-value is pretty straight forward.

The degree of freedom, in this case, would be 3(need info on 3 axes)-1 (only need intercept of log(odds) axis)=2. Now after calculating LHS of the above equation, you can locate the p-value from the table below:

Coding each and every part of the evaluation can be a hectic process, thanks to statsmodels it has been taken care of. It will not only display the R² and p-value but a host of other information that can help to assess the model better.

Start with importing the statsmodels module

Since we are talking about linear separation between features we define our relationship with the label as linear.

running the above script will fetch you the following result:

We can observe here that our P-value is 1.00 since our data was linearly separable, also observe that our log-likelihood is low conforming high R² value.

Few points and Assumptions of the Logistic Regressions:

We only talked about binary classification in this blog but you can also apply logistic regression in multi-class problems. Say there are k-classes, instead of having one set of parameters β0,β, each class c in 0 : (k −1) will have its own offset β0_c 0 and vector β_c and the predicted conditional probabilities will be:

Logistic Regression assumes the decision boundary to be linear. So if you know priorly that your data contains non-linear decision-boundary then maybe a different algorithm may prove better than this one.
Logistic Regression doesn’t spit out classes but probabilities.
Multicollinearity haunts every algorithm because it distorts tests of statistical significance. So, try identifying and solving them before running this algorithm.

You’ve reached the end!

Congratulations! for staying with me till the end and understanding one of the most important algorithms of Machine Learning which also lays the foundation of the same.

Thanks for reading. Have thoughts or feedback? Comment below!