This post follows the logistic regression post in the "Basics and Beyond" series. So if you are new to machine learning then I would recommend going through that post first but if you already know what logistic regression is then let’s get to work!
In this post, we will be coding a logistic regression model from the very basics using python. So let’s get started!
Data
At the core of any machine learning algorithm is data. Your machine learning models will always need data to "learn" from. So for our purpose today we will be using a very simple student admission result dataset which can be found here.
This dataset contains the historical records of applicants. A record consists of the marks of the applicant in two entrance exams and the final admission decision (whether the candidate is admitted or not). Our goal today is to build a logistic regression model that will be able to decide (or better put classify) whether an applicant should be granted admission or not. In our dataset the first two columns are the marks in the two tests and the third column is the decision label (y) encoded in binary (i.e y = 1 if admitted and y = 0 if not admitted). Our aim is to predict this label y.
Alright so now that we know what our data looks like lets define a function to load the data.
We will be calling the above function later to load the dataset. This function returns x and y (note x is made up of the first 2 columns of the dataset whereas y is the last column of the dataset as that is the result column hence in order to return x and y we are returning data[:,:2]
and data[:,-1]
respectively from the function).
You might have noticed another function call within the load_data
function: plot_data(data[:, :2], data[:, -1])
. If you have followed the "Coding Linear Regression from Scratch" post you probably already know what this function is doing and you could just skip to the juicy coding part of the post but if you haven’t, then stick around because we are about to take a deeper dive into the distribution of our data.
Plotting the data
Before we jump to coding our model lets take a second to analyze our data. This will allow us to understand why (if at all) Logistic Regression is the way to go for our given dataset and the associated problem.
In order to visualize the data lets define the plot_data
function we had called from load_data
.
This function on being called generates the following plot:

From just a quick glance at the above plot it becomes quite clear that our data does have a decision boundary which in this case appears to be somewhat a straight line. It is very important to note here that our decision boundary is simple only because our dataset just happens to be distributed such that the decision boundary is approximately a straight line. It is imperative to note that logistic regression can be used to predict way more complex decision boundaries than what our current problem shows. It is possible to even predict elliptical and non-geometric decision boundaries using logistic regression.
Phew that was a lot to take in. Well now that we have our data ready to go let’s start coding the actual linear regression model!
Hypothesis
The first step to defining the architecture of your model is to define the hypothesis. We know the hypothesis for linear regression is in fact the sigmoid function so mathematically speaking our hypothesis is:

The sigmoid function looks something like this:

Now lets define a simple python function that mirrors the mathematical equation:
We have named this function sigmoid instead of hypothesis or h, this will become more intuitive when you see the function being called in statements such as h = sigmoid(x@theta)
in later segments of the program.
Alright so now that we have our hypothesis lets move on to the next step i.e. the cost function.
Cost Function
To evaluate the quality of our model’s output which in our case is going to be y=1(admit) or y=0(reject) we make use of the cost function.
If you feel a bit lost you can always follow the "Basics and Beyond: Logistic Regression" post on the side because this current post is the exact code version of that one.
Alright so for our purpose here lets jump right to the equation of the cost function for logistic regression:

and the code for the same is:
It might seem absurd to see how we are using matrix multiplication instead of actual summation over each training example individually. Using matrix multiplication actually achieves the same result. We will get back to this later in the post.
Now that we have the cost function, what next? Well, our goal is to always minimize the cost of our model so it’s only plausible that if we have the cost function we now find a way to minimize it. There are many ways ( and some even more efficient than the one we are about to use) to minimize the cost function but since our aim here is to code everything from absolute scratch so we will code the gradient descent algorithm to minimize the cost function rather than using an "out-of-the-box" optimizer. Well then let’s move straight to gradient descent.
Gradient Descent
Gradient descent in our context is an optimization algorithm that aims to adjust the parameters in order to minimize the cost function.
The main update step for gradient descent is:

Here we can see that this algorithm is actually the same as that of linear regression. The only difference however is the definition of the hypothesis function h(x) which is the sigmoid function in case of logistic regression. So we multiply the derivative of the cost function with the learning rate(α) and subtract it from the present value of the parameters(θ) to get the new updated parameters(θ). Let’s say this again but in python:
The gradient_descent
function returns theta
and J_all
. theta
is obviously our parameter vector which contains the values of θs for the hypothesis and J_all
is a list containing the cost function after each epoch. The J_all
variable isn’t exactly essential but it helps to analyze the model better as you will see later in the post.
Putting it all together
Now that we have all our functions defined all we need to do is call them in the correct order.
We first call the load_data
function to load the x and y values. x contains the training examples and y contains the labels (the admission result in our case).
You might have noticed that in the code throughout we have been using matrix multiplication to achieve the expressions we want. For example in order to get the hypothesis, we had to multiply each parameter(θ) with each feature vector(x) and pass that to the sigmoid function. We could use for loops for this and loop over each example and perform the multiplication each time however this would not be the most efficient method if we were to have say 10 million training examples. A more efficient approach here would be to use matrix multiplication. If you aren’t very familiar with matrix multiplication I would suggest you go over it once, it’s fairly simple. For our dataset we have two features (i.e. the scores of the applicant in the two tests) so we will have (2+1) 3 parameters. The extra θ0 accounts for our decision boundary (in our case an approximate line) to be as required.

Okay, so we have 3 parameters and 2 features. This means our θ or parameter vector (1-D matrix) will have the dimensions (3,1) but our feature vector will have the dimensions (99,2) {according to our dataset}. You probably have noticed by now that it’s not mathematically possible to multiply these two matrices. However, if we were to add an extra column of ones to the beginning of our feature vector we would have two matrices of the dimensions (99,3) and (3,1). These two matrices are dimensionally compatible now.
In the above code the line x = np.hstack((np.ones((x.shape[0],1)), x))
adds an extra column of ones to the beginning of x in order to allow matrix multiplication as required.
After this we initialize our theta vector with zeros. You can also initialize it with some small random values. We also specify the learning rate and the number of epochs (an epoch is the number of times the algorithm will go through the entire dataset) we want to train for.
Once we have all our hyper-parameters defined, we call the gradient descent function which returns a history of all the cost functions and the final vector of parameters theta
. This theta
vector is essentially what defines our final hypothesis. You may observe that the shape of the theta
vector that is returned by the gradient descent function has the dimensions (3,1). Remember that each individual training example in our dataset also has 3 fields (2 of the original features and 1 extra column of ones added later). Thus each element of the final theta
vector i.e. theta[0]
, theta[1]
and theta[2]
is in fact θ0, θ1 andθ2 respectively and corresponds to the parameter for each field of our training examples. So now our final hypothesis looks like so:

The J_all
variable is nothing but the history of all the cost functions. You can print the J_all
array to see how the cost function progressively decreases for each epoch of gradient descent.

This graph can be plotted by defining and calling the plot_cost
function like so:
Now we can simply plug in our parameters and obtain predictions from our model. But wait?! Our model isn’t predicting y=0 or y=1 in fact the output of our model is a floating point number. Well that’s actually what should be happening. The weird floating point output is actually the probability of an example to belong to the class y=1. So we now need to add one final function to our model. This the prediction part.
Predicting
Let’s define a simple function to interpret the probability output of our model to predict y=1(admit) or y=0(reject).
The above code classifies the example as belonging to y=1(admit) class (or group) if the probability(output of our model) is greater than or equal to 0.5, otherwise it outputs y=0(reject).
Test
You may now test your code by calling a test function that will take as input the size of the scores of an applicant in the two tests and the final theta
vector that was returned by our logistic regression model and will give us the final result as to whether the applicant should be admitted or not.
The Complete Code
That’s it!
Well that’s pretty much it. You can now code logistic regression from absolute scratch. Being able to understand the complete algorithm and code it, is not an easy task so, Well Done! If you feel comfortable with the example we covered in this post feel free to pick up another dataset (for logistic regression) and try your hand at it.
Happy Coding! 🙂
References
For your reference you can find the entire code and dataset here: