The world’s leading publication for data science, AI, and ML professionals.

Logistic Regression Step by Step Implementation

From Theory to Practice

Say we are doing a classic prediction task, where given a input vector with $n$ variables:

And to predict 1 response variable $y$ (may be the sales of next year, the house price, etc.), the simplest form is to use a linear regression to do the prediction with the formula:

Where $W$ is a column vector with $n$ dimension. Say now our question changed a bit, we hope to predict a probability, like what’s the probability of raining tomorrow? In this sense, this linear regression might be a little unfit here, as a linear expression can be unbounded but our probability is ranged in $[0, 1]$.

Sigmoid Function

To bound our prediction in $[0, 1]$, the widely used technic is to apply a sigmoid function:

With numpy we can easily visualize the function.

Loss Function

The definition of loss function of logistic regression is:

Where y_hat is our prediction ranging from $[0, 1]$ and y is the true value. When the actual value is y = 1, the equation becomes:

the closer y_hat to 1, the smaller our loss is. And the same goes for y = 0 .

Gradient Descent

Given this actual value y, we hope to minimize the loss L, and the technic we are going to apply here is gradient descent(the details has been illustrated here), basically what we need to do is to apply derivative to our variables and move them slightly down to the optimum.

Here we have 2 variables, W and b, and for this example, the update formula of them would be:

Where W is a column vector with n weights correspond to the n dimension of x^(i). In order to get the derivative of our targets, chain rules would be applied:

You can try out the deduction on your own, the only tricky part is the derivative of sigmoid function, for a good explanation you can refer to here.

Batch Training

The above gives the forward and backward updating process, which is well enough to implement a logistic regression if we were to feed in our training model ONE AT A TIME. However, in most training cases, we don’t do that. Instead training samples are feed in batches, and the backward propagation is updated with average loss of the batch.

Which means that for a model that feed with m samples at a time, the loss function would be:

Where i denotes the ith training sample.

Forward Propagation of Batch Training

Now instead of using x, a single vector, as our input, we specify a matrix X with size n x m, where as above, n is the number of features and m is number of training samples (basically, we line up m training samples in a matrix). Now the formula becomes:

Note that here we use UPPER LETTER to denote our matrix and vectors (a caveat is that b here is still a single value, the more formal way would be to represent b as a vector, but in python the addition of a single value to a matrix would be automated broadcasted).

Let’s break down the size of the matrices one by one.

Generate Classification Task

Our formula stuff ends here, let’s implement our algorithm, before that some data needs to be generated to make a classification task (the whole implementation is also in my git repo).

Implementation

Now everything is set, let’s go for the implementation.

Helper Functions

  1. sigmoid function that take in an array
  2. weight function that initialize of W and b to zeros
  3. accuracy function to measure our accuracy of binary prediction

    Prediction

Our predict function would be simply going through the forward process given trained weights

Training

Notice that for the train function, the input shape of X needs to have shape of n x m, and Y with shape of 1 x m, where m is the batch size.

The input need to transpose in order to fit in our training requirements.

Ensemble in a Class

Now let’s ensemble everything into a class to look more structured. For completeness, a batched training would also be implemented

The full training detail can check here.


Related Articles