Logistic Regression is one of the most popular classification techniques. In most of the tutorials and articles, people usually explain the probabilistic interpretation of Logistic Regression. So in this article, I will try to give the geometric intuition of Logistic Regression. The topics that I will cover in this article –
- Geometric Intuition of Logistic Regression
- Optimisation Function
- Sigmoid Function
- Overfitting and Underfitting
- Regularisation – L2 and L1
The Intuition

From the above image we can simply think of Logistic Regression as the process of finding a plane that best separates our classes and as mentioned above Logistic Regression assumes that the classes are linearly separable.
So now what we need is a classifier that can separate the two classes. From figure1 we can observe that W^T * Xi > 0
represents the positive class as the positive class points lie in the direction of W
and W^T * Xi < 0
represents the negative class .
So our classifier is –
If W^T * Xi > 0 : then Y = +1 where Y is the class label
If W^T * Xi < 0 : then Y = -1 where Y is the class label
In the above paragraph, I mentioned that the objective of Logistic Regression is to find the plane that best separates the two classes and if you are an inquisitive person then you must be wondering how will we determine the best separation. So let’s try to understand this.
Finding the right plane
In order to measure anything we need a value and in this case, we will get that value by defining an optimisation function and the result of that function will be used to determine which plane is the best. That is just as vague and abstract as it can get, but I would like to explain it properly using few cases and their corresponding examples. Hold tight !!
Case 1 - [Yi = +1] , [W^T * Xi > 0]
Yi = +1 means that the correct class label is positive.
Yi * W^T * Xi > 0 means that we have correctly predicted the class label.
Example – W^T * Xi = 5 (5 > 0) and Yi = +1.
Here, Yi W^T Xi = 5
Case 2 - [Yi = -1] , [W^T * Xi < 0]
Yi = -1 means that the correct class label is negative.
Yi * W^T * Xi > 0 means that we have correctly predicted the class label.
Example – W^T * Xi =-5 (-5 < 0) and Yi = – 1.
Here, Yi W^T Xi = (-1)(-5) = 5
Case 3 - [Yi = +1] , [W^T * Xi < 0]
Yi = +1 means that the correct class label is positive.
Yi * W^T * Xi < 0 means that we have incorrectly predicted the class label.
Example – W^T * Xi =-5 (5 < 0) and Yi = +1.
Here, Yi W^T Xi = (1)(-5) = -5
Case 4 - [Yi = -1] , [W^T * Xi > 0]
Yi = -1 means that the correct class label is negative.
Yi * W^T * Xi < 0 means that we have incorrectly predicted the class label.
Example – W^T * Xi = 5 (5 < 0) and Yi = -1.
Here, Yi W^T Xi = (-1)(5) = -5
If you look at these cases carefully then you will observe that Yi * W^T*Xi > 0
means that we have correctly classified the points and Yi * W^T * Xi < 0
means that we have incorrectly classified the points.
Looks like we have found our much-awaited optimisation function.

So the plane with the maximum value for this function will act as the decision surface (the plane that best separates our points).
Analysing the optimisation function
Before you start celebrating the fact that we have got our optimisation function, let’s analyse this function and make sure that this function works properly regardless of the dataset.

As you might have guessed by now, our optimisation function is not robust enough to handle any outliers. Intuitively if you look at the above figure you will realise that ㄫ1 is a better plane than ㄫ2 as ㄫ1 has correctly classified 14 data points and ㄫ2 only correctly classified a single data point but according to our optimisation function, ㄫ2 is better.
There are various methods to remove outliers from the dataset but there is no such method that can remove 100% outliers and as we have seen above, even a single outlier can heavily impact our search for the best plane.
So how can we handle this problem of outliers? Enter Sigmoid Function.
Sigmoid Function
The basic idea behind the Sigmoid function is squishing. Squishing can be explained as follows.
If signed distance is small : then use it as is
If signed distance is large : then squish it to a smaller value
The Sigmoid function squishes the larger values and all the values will lie between 0 and 1.
Now you must be wondering that there are various other functions that can do the same job of limiting the values of our function within a certain range, then what is so special about sigmoid function.
Why Sigmoid ??
There are various reasons to choose sigmoid function over others –
- It provides a nice probabilistic interpretation. Ex – If a point lies on the decision surface (d = 0) then by intuition it’s probability should be 1/2 as it can belong to any class and here also we can see that – Sigma(0) = 1/2.
- If is easy to differentiate.
If you are still not convinced then you can check this link to know more about sigmoid function.
So our new optimisation function is –

We can further modify it by taking the log of this function to simplify the math.Since log is a monotonically increasing function therefore it won’t affect our model. If you are not aware of what a monotonically increasing function is, then here is a brief overview –
A function g(x) will be called a monotonically increasing function
If - when x increases g(x) also increases
So if x1 > x2 then g(x1) > g(x2) if g(x) is a monotonically increasing function.
Transforming the Optimisation Function
There are still few transformations left before we get to the best version of our optimisation function.
- Taking the Log for simplifying the mathematics involved in optimisation of this function.

- Transformation using the Log Property log(1/x) = -log(x)

- By using the property argmax(-f(x)) = argmin(f(x))

Strategy for minimisation
n
W` = argmin(∑i=1 log(1 + exp(- Yi W^T Xi)) //Optimisation Function
Let Z = Yi W^T Xi
n
W` = argmin(∑i=1 log(1 + exp(- Zi))
exp(-Zi) will always be positive. We are looking to minimise our optimisation function and the smallest value for exp(-Zi) is 0 .
n
W` = argmin(∑i=1 log(1 + exp(- Zi)) >= 0
The minimum value for our optimisation function is 0, which occurs when exp(-Zi) is 0 as log(1+0) = 0.
So the overall minimum value for our optimisation function will occur when
Zi -> +∞ for all i
Let’s take a closer look at the term Zi .
Z = Yi W^T Xi
Since it is a supervised learning algorithm therefore we are given the values of X and Y.
X – Features on the basis of which we predict the correct class label
Y – The correct class label
So we can’t change Xi or Yi hence the only term left to manipulate is "W". You can get a slight intuition that if we pick a really large value of W then only our Z will move closer to infinity.
In order to move the value of Zi to infinity we will pick a very large value (either + or -) for W.
Case 1 - [Yi = +1]
Yi * W^T * Xi = +1 * (very large +ve value of W ) * Xi = Very large +ve value
Case 2— [Yi = -1]
Yi * W^T * Xi = -1 * (very large -ve value of W ) * Xi = Very large +ve value
So as you can see if we pick a large value for W then we can accomplish our goal of making Zi -> +∞
Problem with this strategy
By using the above strategy everything looks fine as our goal was to make –
Zi-> ∞
and make log(1 + exp(- Zi)) -> 0
and if we use this strategy then we can successfully do that.
The only problem with this strategy is that we can successfully minimise our optimisation function for all values of ‘ i’. Sounds a bit ironic as our goal was to minimise the function for all values of i, and all of a sudden it became a problem. If you are frustrated then it’s a really good sign and it means that you have understood each and every detail till now. So let’s dig deeper into this problem.
The main problem here is that we are overfitting our model. If you are not familiar with the term overfitting then here is a brief overview –
Overfitting means that our model will work pretty well for the training data as it will just adjust the weights according to the training data and as a result it will do a really bad job on the test data
That is not a technically correct definition and I just wanted to give you an intuition of overfitting.
Overfitting

Here the red dots represent negative data points and green dots represent positive data points.
As you can see in the case of overfitting that our decision surface perfectly classifies each and every point and we will get a 100% accurate result on our training data in this case. But consider this scenario –

Here the blue point is our test data point and we want to predict whether it belongs to positive or negative class and as you can see, according to our decision surface it is a negative class point but we can see that it is most probably a positive class point since it is closer to positive class points than negative class points. This is called Overfitting.
This is exactly how our model will look like if we follow the above strategy of always choosing a large value of W and making Zi -> +∞
Regularisation to the rescue
Now that you have FINALLY understood what the actual problem is, we can move towards finding a solution and that solution is Regularisation.
A lot of you might have a vague idea about this and you must have heard that it is used to prevent overfitting and underfitting but very few people actually know how can we prevent underfitting and overfitting by using regularisation. So BRACE YOURSELF you are about to join that elite group.
There are two major types of regularisation –
- L2 regularisation
- L1 regularisation
L2 regularisation –
In L2 regularisation we introduce an additional term called the regularisation term in order to prevent overfitting.

n
W * = argmin(∑i=1 log(1 + exp(- Yi W^TXi)) - Loss term
λ W^T W - Regularisation Term
Here ‘λ’ is a hyperparameter that will play an important role in our classification model but first let’s focus on the effect of this regularisation term.
If you remember that our goal was to make Zi -> +∞
and since Xi and Yi are fixed therefore we could only tweak the value of W and here you can see that we are multiplying W^TW with λ.
So earlier we were increasing the value of W to make it +∞ or -∞
but now if we try to do that then although the value of our loss term will move towards 0 the value of our regularisation term will be very very large. So there is essentially a trade-off between loss term and regularisation term.
The regularisation term essentially penalises our model for choosing very large values of W, hence avoiding overfitting.
Role of λ
λ plays a key role in optimising our function.
- If we significantly decrease the value of λ then the model overfits as the effect of the regularisation term becomes negligible.
- If we significantly increase the value of λ then our model underfits as the loss term becomes negligible and the regularisation term doesn’t contain any training data.
L1 regularisation
The purpose of L1 regularisation is the same as that of L2 i.e. avoiding overfitting in this case.
n
W * = argmin(∑i=1 log(1 + exp(- Yi W^TXi)) - Loss term
λ ||W|| - Regularisation Term
n
Here ||W|| = ∑i=1 |Wi| where n is the number of data points and |Wi| represents the absolute value of W.
The main difference between L1 and L2 Regularisation is that L1 Regularisation creates sparse vectors.
F = <f1,f2,f3,fi..... fn>
W = <W1,W2,W3,Wi..... Wn>
Here if we have a feature fi which is not important or less important then the weight corresponfing to it will be 0 if we use L1 regularisation whereas if we use L2 regularisation then it will be a small value but not neccesarily 0.
And with that, we have come to the end of this article. Thanks a ton for reading it.
You can clap if you want. IT’S FREE.
My LinkedIn, Twitter and Github You can check out my website to know more about me and my work.