Logistic Regression is essentially a must-know for any upcoming Data Scientist or Machine Learning Practitioner. It is most likely the first classification model one has encountered. But, the question is, how does it really work? What does it do? Why is it used for classification? In this article, I hope to answer all these questions, and by the time you finish reading this article, you will have:
- Learn how to explain the Logistic Regression model in simple language
- Learn how Logistic Regression is mathematically formed
- Implement Logistic Regression from Scratch using Python
So, get ready for the wild adventure ahead, partner!

The Logistic Regression Model Explained
Logistic Regression is a statistical model that uses a logistic function to predict the probability of an instance belonging to a particular class. if the estimated probability is greater that 50%, then the model predicts that the instance belongs to the positive class(1). If it does not exceed 50%, then the model predicts that it belongs to the negative class. Logistic Regression is used in scenarios such as
- Does a patient have breast cancer or not?
- Will a student be able to pass their test?
- Is a given image appropriate or not?
- Is this credit card transaction a fraud transaction or a normal transaction?
All of these are examples of where Logistic Regression can be used. Well, great! Now you know how to define what Logistic Regression is, and explain what it does. But the question is, how does it work, step-by-step? To explain this, we will compare Logistic Regression to Linear Regression!
Understanding Logistic Regression through Linear Regression
In my previous article, I explained Linear Regression; how it works, what it does and how to implement it. In many ways, Logistic Regression is quite similar to Linear Regression. So, let’s briefly summarise what Linear Regression does:
- A linear trendline, or hyperplane, is fit to the data
- The distance between the points(the red dots on the figure being the points, and the green line being the distance) is calculated, and then squared, before they are summed(The values are squared to ensure that negative values do not produce an incorrect value and hinder the calculation). This is the error of the algorithm, or better knows as the residual.
- We then use an optimisation algorithm(More on that later) to "shift" the algorithm so that it can fit the data better, based on a cost function
- Steps 2 + 3 are repeated until we reach a desirable output, or our error is close to 0.

In Logistic Regression, it is quite similar, but there are a few differences:
- Logistic Regression predicts a discrete value(0 or 1), while Linear Regression is used to predict continuous values(245.6,89.6 etc..)
- Instead of fitting a trendline to the data, it fits an S shaped curve the data, known as a logistic function. This curve ranges from 0 to 1, and tells you the probability of a class being positive(1) or negative(0).
- The residual calculation of the model is not applicable to Logistic Regression, which we will discover later when we get into the nitty and gritty details of Logistic Regression. As a results, it also does not use the same cost functions as Linear Regression.
All right, so now you know what Logistic Regression is(in layman’s terms), a high level overview of what it does and the steps it takes to operate. But I’m certain you have some questions such as:
- Why is Logistic Regression called Logistic Regression, not Logistic Classification?
- What is a cost function? Why do they differ when using Linear and Logistic Regression?
- What is an optimisation algorithm?
I assure you, now, all that you seek will be answered!
Clearing up some basic statistics before we continue
Note: right now, these concepts may seem irrelevant, but bear with me, as these concepts form the basis of Logistic Regression and will help give you a much better understanding of the algorithm.

- Odds: the ratio of something occurring to something not occurring. For example, let’s suppose that the L.A Lakers played 13 matches, and won 5 matches while losing 8. Then, the odds of the Lakers winning their next match would be the ratio of something happening(Lakers winning) to something not happening(the Lakers losing). So, in this example, the odds of them winning would be 5/8.
- Probability: the ratio of something occurring to everything that can occur. Going back to our Lakers example, the probability of the Lakers winning their next match is the ratio of something occurring(the Lakers winning) to everything that can occur(Lakers winning and losing). So, the probability would be 5/13.
We have seen one way of calculating the odds, but we can also calculate odds from the probability using this formula:

where p = probability of something happening. To verify this, let’s try see if we can get the odds of the Lakers winning by using this formula:

And sure enough, we get the same result!
The problem with odds
Going back to our example, let’s assume that the Lakers were having a terrible season(clearly not the case), and out of 20 games, they only won 1. so the odds to the Lakers winning would be:

If they played even worse throughout the season, and won 2 games out of 100, then the odds of them winning would be:

We can make a simple observation: the worse they play, the more close their odds of winning will be to 0. Concretely, when the odds are against them winning, then the odds will range between 0 and 1.
Now let’s look at the opposite. If the lakers were to play 20 games, and win 19, then their odds of winning would now be:

If they were to play 200 games, and win 194, then their odds of winning would be:

In other words, when the odds are for the Lakers winning, they begin at 1 and they can go all the way up to infinity.
Clearly, there is a problem here. The odds against the Lakers winning ranges from 0–1, but the odds for them winning ranges from 1-infinity. This asymmetry makes it hard to compare the odds for or against Lakers winning. If only we had a function that makes everything symmetrical…
Introducing the Logit Function

Fortunately for us, this exact function exists! It is called the logit function, or the log of the odds. Essentially, it outputs the…log of the odds!
Let’s demonstrate this with an example using yet again the Lakers(sorry!). If the Lakers were to play 7 games, and win only one, the log of the odds for them winning would be:

And if instead they won 6 games and and only lost 1:

Using the log(odds), the distance from the origin is equal for 1 to 6, or 6 to 1. This logit function is vital to understand as it forms the basic of Logistic Regression. If you are still unsure about what odds and log(odds) are, check out this great video by Statquest.
The "Regression" behind Logistic Regression
Ok guys, I’m am officially going to drop the bomb on you:

Logistic Regression(on its own) is not a classification algorithm. It is actually a regression model(hence the name Logistic Regression)! How? Well, without further ado, let’s take a deep dive of the Logistic Regression model.
Taking a deep dive into Logistic Regression
Logistic Regression is actually a part of the Generalised Linear Model (GLM) which was originally created by John Nelder and Robert Wedderburn. While Linear Regression has response values coming from the _Normal Distribution, Logistic Regression’s response values come from the Binomial Distribution_(having values of 0 and 1).
Logistic Regression is a special type of GLM that can generalise linear regression by enabling the to have a relation to the response variable using a link function and allowing the magnitude of the variance of each measurement to be a function of its predicted value. (Source).
Basically, Logistic Regression is type of GLM that output values of 0 and 1 by using a logit link function, and can use a special cost function to calculate the variance of the model.
Right now, you must me mind-boggled as to how this all works, but stick with me for now and it will make sense soon.
How Logistic Regression works Part 1: The Theory
As we previously discussed, the y-axis values(or target values) of Linear Regression could, in theory, be any number ranging from -infinity to +infinity.
However, in Logistic Regression, the y-axis values only range between 0 and 1.
So, to solve this problem, the y-axis values are transformed from the probability of X happening to… the log(odds) of something happening! Now, the values can range from infinity to +infinity, just like in Linear Regression. We do this transformation using the logit function that we previously discussed.
So, in other words, we transform our S shaped curve into a straight line. This means that while we still use the S shaped curve in Logistic Regression, the coefficients are calculated in terms of the log(odds).
After this, the algorithm is essentially a linear model; We have our straight trendline, and by we then fit it onto the data using an optimisation algorithm along with a modified cost function for Logistic Regression(more on this later). Once we have made predictions, we "translate" or invert our predictions using a link function, with the most common one known as the sigmoid function, back into a probability.
Ok, so now you have a slightly better idea of how Logistic Regression works. Let’s try solidify your knowledge with the an example using math.
How Logistic Regression works Part 1: The Maths
Let’s say we have a model with two features, X1 + X2, and single binomial response variable Y, which we denote p = P(Y=1):

The log of the odds of the event that Y=1 can be expressed as:

We can recover the odds by exponentiating the log-odds:

By simple algebraic manipulation, the probability that Y=1 is:

This formula is essentially the link function that translates our log-odds predictions into probabilities. The function, also known as the sigmoid function, can also be represented as so:

This can be visually seen as an S shaped curve:

Here, we can clearly see how we can calculate the log of the odds for a given instance or the probability that Y=0 given the weights. The part that makes Logistic Regression a classification model is the threshold used to split the predicted probabilities. Without this, Logistic Regression is truly a Regression model.
A step by step guide to how the algorithm works
- Randomly initialise parameters(or coefficients) for the model
- Apply the sigmoid function to the predictions of the linear model
- Update parameters by using an optimisation algorithm to get a better fit to the data.
- Calculate the error.
- Repeat 2–4 for either n iterations, until a desirable outcome is reached or the calculated error of the model is close to or equal to 0.
So now you should have a good understanding of how Logistic Regression works. But I’m sure you may still have questions such as:
- Hey, you never explained what an optimisation algorithm is!
- Buddy, what’s a cost function?
- Wait, it’s a linear model, right? So do we need to feature scale?
Ok, you got me. So, just before we start coding, let me explain these concepts in layman’s terms.
Cost Functions
A cost function is essentially a formula that measures the loss, or the "cost" of your model. If you have ever done any Kaggle competitions, you may have come across some of them. A few common ones include:
- Mean Squared Error
- Root Mean Squared Error
- Mean Absolute Error
- Log Loss
Now, It is worth highlighting an important difference between Linear Regression and Logistic Regression:
- Linear Regression basically calculates cost by calculating the residuals(how far each points are from the hypothesis). This can be seen is the image below as the green lines diverging from the points in the image below:

Note the green lines coming from the each data point. This is the residual of the model and is used in calculating the cost of the model for Linear Regression.
- When I previously said that the probability is transformed to a log-odds value, I did not mention one scenario; when the probability of X happening is 1. When this is the case, we get the following output:

Now, we can’t do our calculation of the residual, because the distance between the point and the hypothesis could be infinity. That’s why, we use a special cost function called the log-loss:

this function is derived from something known as maximum likelihood estimation. I’m aware of the length of this article, so If you want to get more info on it, I recommend checking out this video for more intuition on the function.
These functions are essential to model training and development as they answer the fundamental question of "how well is my model predicting new instances?". Keep this in mind, as this ties in with our next topic.
Optimisation Algorithms
Optimisation is usually defined as the process of improving something so that it operates at its full potential. This is also applicable in Machine Learning. In the world of ML, optimisation is essentially trying to find the best combination of parameters for a certain dataset. This is essentially the "learning" bit of Machine Learning.
While may optimisation Algorithms exist, I will discuss two of the most common ones: Gradient Descent and The Normal Equation.
Gradient Descent
Gradient Descent is an optimisation algorithm that aims to find the minimum of a function. It achieves this goal by iteratively taking steps in the negative direction of the slope. In our example, gradient descent would continuously update the weights by moving in the slope of the tangential line to the function. Well fantastic, sound great. English please? 🙂
A concrete example of Gradient Descent

To better illustrate Gradient Descent, Let’s go through a simple example. Imagine a human is at the top of a mountain, and he/she want to get to the bottom. What they might do is look around and see in what direction they should take a step in in order to get down quicker. Then, they might take a step in that direction and now they are closer to their goal. However, they have to be careful when coming down as they might get stuck at a certain point, so we have to make sure to choose our step sizes accordingly.
Similarly, The objective of gradient descent is to minimise a function. In our case, it is to minimise the cost of our model. It does this by finding the tangential line to the function and moving in that direction. The size of the "step" of the algorithm is defined by what is known as a learning rate. This essentially controls how far we move down. With this parameter, we have to be careful of two cases:
- The learning rate is too large, the algorithm might not converge(reach a minimum) and bounce around the minimum, but never be at it
- The learning rate is too small, the algorithm will take too long to get to the minimum, also might get "stuck" at a sub-optimal point.
We also have a parameter that controls the number of times the algorithm iterates over the dataset.
Visually, the algorithm would do something like this:

Because this algorithm is so essential to Machine Learning, let’s recap what it does:
- Randomly initialises the weights. This is called(you guessed it) random initialisation)
- The model then makes predictions using these random weights.
- The model’s predictions are evaluated through a cost function.
- The model then runs gradient descent, by finding the tangential line of the function, and then taking a step in that slope of the tangent
- The process is repeated for N iterations, or if a criteria is met.
Advantages and Disadvantages of Gradient Descent
Advantages:
- Is very likely to reduce the cost function to a global minimum(very close to or = 0)
- One of the most effective optimisation algorithms
Disadvantages:
- Can be slow on large datasets, as it uses the whole dataset to compute the gradients of the tangential line of the function
- Is liable to getting stuck at a sub-optimal point(or local minima)
- The user has to manually choose the learning rate and the number of iterations, which can be time consuming
Now that you have been introduced to Gradient Descent, let’s introduce the Normal Equation.
Normal Equation

If we were to go back to our example, instead of taking steps iteratively down the mountain, we would be able to immediately get to the bottom. This is the case with the Normal Equation. It leverages linear algebra to produce weights that can produce results just as well as Gradient Descent would, in a fraction of the time.
Advantages and Disadvantages of the Normal Equation
Advantages:
- No need to chose a learning rate or the number of iterations
- Extremely fast
Disadvantages:
- Does not scale well to large datasets
- Tends to produce good weights, but not optimal ones
Feature Scaling

This is an important preprocessing step of many Machine Learning Algorithms, especially those that use distance metrics and calculations(like Linear Regression,Gradient Descent and of course, Logistic Regression as it really is a regression model!). It essentially scales our features so that they are in a similar range. Think of it like a house, and a scaled model of a house. The shape of both are the same(they are both houses), but the size is different(5m != 500m). We do this for the following reasons:
- It speeds up algorithms
- Some algorithms are sensitive to scale. In other words, if the features have different scales, there is a chance that higher weightage is given to features with higher magnitude. This will impact the performance of the machine learning algorithm and obviously, we do not want our algorithm to be biassed towards one feature.
To demonstrate this, let’s suppose we have three features, named A, B and C:
- Distance of AB before scaling =>

Distance of BC before scaling =>

Distance of AB after scaling =>

Distance of BC after scaling =>

We can clearly see that the features are much more comparable and unbiased than they were before scaling. If you want a great tutorial on feature scaling, please check out this blogpost by Analytics Vidhya.
Wow! Your brain must be filled with tons of information! So, I suggest taking a break, strolling around, enjoying life and doing some stretches before we get in to actually coding the Algorithm from Scratch!
Coding Logistic Regression from Scratch

Ok, now the moment you have been waiting for; the implementation! Without further ado, let’s begin!
Note: all the code can be downloaded from this Github repo. However, I recommend you to follow along the tutorial before you do so, because then you will gain a better understanding of what you are actually coding!
First, let’s do some basic imports:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
All we are using is numpy for the mathematical computations, matplotlib to plot graphs, and the breast cancer dataset from scikit-learn.
Next, let’s load our data and define our features and our labels:
# Load and split data
data = load_boston()
X,y = data['data'],data['target']
Next, let’s create a custom train_test_split function to split our data into a training and test set:
# Custom train test split
def train_test_divide(X,y,test_size=0.3,random_state=42):
np.random.seed(random_state)
train_size = 1 - test_size
arr_rand = np.random.rand(X.shape[0])
split = arr_rand < np.percentile(arr_rand,(100*train_size))
X_train = X[split]
y_train = y[split]
X_test = X[~split]
y_test = y[~split]
return X_train, X_test, y_train, y_test
X_train,X_test,y_train,y_test = train_test_divide(X,y,test_size=0.3,random_state=42)
Basically, we are just
- getting in a test size.
- setting a random seed to make sure our results and repeatable.
- Obtaining the train set size based on the test set size
- Picking random samples from our features
- Splitting the randomly selected instances into train and test set
In order to make clean, repeatable and efficient code, as well as adhere to software development practices, we will make create a Logistic Regression class:
class LogReg:
def __init__(self,X,y):
self.X = X
self.y = y
self.m = len(y)
self.bgd = False
Let’s add our link function as a method of the class:
def sigmoid(self,z):
return 1/ (1 + np.exp(-z))
Now, let’s implement our cost function using the formula described above:
def cost_function(self,X,y):
h = self.sigmoid(X.dot(self.thetas.T))
m = len(y)
J = (1/m) * (-y.dot(h) - (1-y).dot(np.log(1-h)))
return J
- h = hypothesis(our inference or predictions)
- m = number of instances in the training set
- J = our cost function
We will also add a method to insert an intercept term:
def add_intercept_term(self,X):
X = np.insert(X,0,np.ones(X.shape[0:1]),axis=1).copy()
return X
- This basically inserts a column of ones at the beginning of our features.
- If we did not add this, then we would be forcing the hyperplane to pass through the origin, causing it to tilt considerably and hence not fitting the data properly.
Next, we will scale our features:

def feature_scale(self,X):
X = (X - X.mean()) / (X.std())
return X
We will add a method to randomly initialise the parameters of the model:
def initialise_thetas(self):
np.random.seed(42)
self.thetas = np.random.rand(self.X.shape[1])
Now, we will implement the Normal Equation using the formula below:

def normal_equation(self):
A = np.linalg.inv(np.dot(self.X.T,self.X))
B = np.dot(self.X.T,self.y)
thetas = np.dot(A,B)
return thetas
Essentially, we split the algorithm into 3 parts:
- We get the inverse of the dot product of X transposed and X
- We get the dot product of our weights and our labels
- We get the dot product of our two calculated values
So, that’s the Normal Equation! Not too bad! Now, we will implement Batch Gradient Descent using the following formula:

def batch_gradient_descent(self,alpha,n_iterations):
self.cost_history = [0] * (n_iterations)
self.n_iterations = n_iterations
for i in range(n_iterations):
h = self.sigmoid(np.dot(self.X,self.thetas.T))
gradient = alpha * (1/self.m) * (h - self.y).dot(self.X)
self.thetas = self.thetas - gradient
self.cost_history[i] = self.cost_function(self.X,self.y)
return self.thetas
Here, we do the following:
- We take in alpha, or the learning rate, and the number of iterations
- We create a list to store our cost functions history to later plot in a line plot
- We loop through the dataset n_iterations times,
- We obtain the predictions, and calculate the gradient(the slope of the tangential line of the function). This is referred to as h(x)
- We update the weights to move down the gradient, by subtracting our predictions from their actual values and multiplying to by each feature
- We record the values using our custom log loss function.
- Repeat, and when finished, return our final optimised parameters.
Let’s create a fit function to fit our data:
def fit(self,bgd=False,alpha=0.4,n_iterations=2000):
self.X = self.feature_scale(self.X)
if bgd == False:
self.add_intercept_term(self.X)
self.thetas = self.normal_equation()
else:
self.bgd = True
self.add_intercept_term(self.X)
self.initialise_thetas()
self.thetas = self.batch_gradient_descent(alpha,n_iterations)
We will also create a plot_function method to show the magnitude of the cost function over the number of epochs:
def plot_cost_function(self):
if self.bgd == True:
plt.plot(range((self.n_iterations)),self.cost_history)
plt.xlabel('No. of iterations')
plt.ylabel('Cost Function')
plt.title('Gradient Descent Cost Function Line Plot')
plt.show()
else:
print('Batch Gradient Descent was not used!')
Finally, we will create a predict function for inference, setting the threshold to 0.5:
def predict(self,X_test):
self.X_test = X_test.copy()
self.X_test = self.feature_scale(self.X_test)
h = self.sigmoid(np.dot(self.X_test,self.thetas.T))
predictions = (h >= 0.5).astype(int)
return predictions
The full code for the class looks like the following:
class LogReg:
def __init__(self,X,y):
self.X = X
self.y = y
self.m = len(y)
self.bgd = False
def sigmoid(self,z):
return 1/ (1 + np.exp(-z))
def cost_function(self,X,y):
h = self.sigmoid(X.dot(self.thetas.T))
m = len(y)
J = (1/m) * (-y.dot(h) - (1-y).dot(np.log(1-h)))
return J
def add_intercept_term(self,X):
X = np.insert(X,0,np.ones(X.shape[0:1]),axis=1).copy()
return X
def feature_scale(self,X):
X = (X - X.mean()) / (X.std())
return X
def initialise_thetas(self):
np.random.seed(42)
self.thetas = np.random.rand(self.X.shape[1])
def normal_equation(self):
A = np.linalg.inv(np.dot(self.X.T,self.X))
B = np.dot(self.X.T,self.y)
thetas = np.dot(A,B)
return thetas
def batch_gradient_descent(self,alpha,n_iterations):
self.cost_history = [0] * (n_iterations)
self.n_iterations = n_iterations
for i in range(n_iterations):
h = self.sigmoid(np.dot(self.X,self.thetas.T))
gradient = alpha * (1/self.m) * (h - self.y).dot(self.X)
self.thetas = self.thetas - gradient
self.cost_history[i] = self.cost_function(self.X,self.y)
return self.thetas
def fit(self,bgd=False,alpha=0.4,n_iterations=2000):
self.X = self.feature_scale(self.X)
if bgd == False:
self.add_intercept_term(self.X)
self.thetas = self.normal_equation()
else:
self.bgd = True
self.add_intercept_term(self.X)
self.initialise_thetas()
self.thetas = self.batch_gradient_descent(alpha,n_iterations)
def plot_cost_function(self):
if self.bgd == True:
plt.plot(range((self.n_iterations)),self.cost_history)
plt.xlabel('No. of iterations')
plt.ylabel('Cost Function')
plt.title('Gradient Descent Cost Function Line Plot')
plt.show()
else:
print('Batch Gradient Descent was not used!')
def predict(self,X_test):
self.X_test = X_test.copy()
self.X_test = self.feature_scale(self.X_test)
h = self.sigmoid(np.dot(self.X_test,self.thetas.T))
predictions = (h >= 0.5).astype(int)
return predictions
Now, let’s call our newly made class using the Normal Equation:
log_reg_norm = LogReg(X_train,y_train)
log_reg_norm.fit(bgd=False)
And evaluate:
accuracy = round((log_reg_norm.predict(X_test) == y_test).mean(),2)
accuracy
OUT:
0.82
And now, with Gradient Descent, we can also plot the cost function:
log_reg_bgd = LogReg(X_train,y_train)
log_reg_bgd.fit(bgd=True)
log_reg_bgd.plot_cost_function()

And now let’s evaluate it:
accuracy = round((log_reg_bgd.predict(X_test) == y_test).mean(),2)
accuracy
So, Now you have successfully implemented Logistic Regression from Scratch!
Some homework for you to do:
- We can see that Batch Gradient Descent is the clear winner here, in terms of optimisation. However, you could try running it again with less iterations, as it has seemed to converge. What is the difference?
- Try skip feature scaling. Does this affect your results?
- Try leave out the intercept term. How much of an impact does this have?
I have really enjoyed blogging these past few months, and I am grateful to all my followers who always show their appreciation for my work. Therefore, I would like to thank you for taking time in your day to read this article, and I hope to continue to produce more interesting articles to share with the world. Stay tuned, and have a good time!
