Implement Logistic Regression with L2 Regularization from scratch in Python

A step-by-step guide to building your own Logistic Regression classifier.

Tulrose Deori
Towards Data Science

--

Photo by Markus Spiske on Unsplash

Table of contents:

  1. Introduction
  2. Pre-requisites
  3. Mathematics behind the scenes
  4. Regularization
  5. Code
  6. Results and Demo
  7. Future Works and Conclusions
  8. References

1. Introduction:

Logistic Regression is one of the most common machine learning algorithms used for classification. It a statistical model that uses a logistic function to model a binary dependent variable. In essence, it predicts the probability of an observation belonging to a certain class or label. For instance, is this a cat photo or a dog photo?

NB: Although Logistic Regression can be extended to multi-class classification, we will discuss only binary classification settings in this article.

2. Pre-requisites:

The reader is expected to have an understanding of the following:

  • What is a dataset?
  • What is a feature?
  • What is multicollinearity?
  • What is a sigmoid function?

3. Mathematics behind the scenes

Assumptions: Logistic Regression makes certain key assumptions before starting its modeling process:

  1. The labels are almost linearly separable.
  2. The observations have to be independent of each other.
  3. There is minimal or no multicollinearity among the independent variables.
  4. The independent variables are linearly related to the log odds.

Hypothesis: We want our model to predict the probability of an observation belonging to a certain class or label. As such, we want a hypothesis h that satisfies the following condition 0 <= h(x) <= 1 , where x is an observation.

We define h(x) = g(w* x) , where g is a sigmoid function and w are the trainable parameters. As such, we have:

The cost for an observation: Now that we can predict the probability for an observation, we want the result to have the minimum error. If the class label is y, the cost (error) associated with an observation x is given by:

Cost Function: Thus, the total cost for all the m observations in a dataset is:

We can rewrite the cost function J as:

The objective of logistic regression is to find params w so that J is minimum. But, how do we do that?

Gradient Descent:

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.

We will update each of the params wᵢ using the following template:

The above step will help us find a set of params wᵢ, which will then help us to come up with h(x) to solve our binary classification task.

But there is also an undesirable outcome associated with the above gradient descent steps. In an attempt to find the best h(x), the following things happen:

CASE I: For class label = 0
h(x) will try to produce results as close 0 as possible
As such, wT.x will be as small as possible
=> Wi will tend to -infinity
CASE II: For class label = 1
h(x) will try to produce results as close 1 as possible
As such, wT.x will be as large as possible
=> Wi will tend to +infinity

This leads to a problem called overfitting, which means, the model will not be able to generalize well, i.e. it won’t be able to correctly predict the class label for an unseen observation. So, to avoid this we need to control the growth of the params wᵢ. But, how do we do that?

4. Regularization:

Regularization is a technique to solve the problem of overfitting in a machine learning algorithm by penalizing the cost function. It does so by using an additional penalty term in the cost function.
There are two types of regularization techniques:

  1. Lasso or L1 Regularization
  2. Ridge or L2 Regularization (we will discuss only this in this article)

So, how can L2 Regularization help to prevent overfitting? Let’s first look at our new cost function:

λ is called the regularization parameter. It controls the trade-off between two goals: fitting the training data well vs keeping the params small to avoid overfitting.

Hence, the gradient of J(w) becomes:

The regularization term will heavily penalize large wᵢ. The effect will be less on smaller wᵢ’s. As such, the growth of w is controlled. The h(x) we obtain with these controlled params w will be more generalizable.

NOTE: λ is a hyper-parameter value. We have to find it using cross-validation.

  • Larger value λ of will make wᵢ shrink closer to 0, which might lead to underfitting.
  • λ = 0, will have no regulariztion effect.

When choosing λ, we have to take proper care of bias vs variance trade-off.

You can find more about Regularization here.

We are done with all the Mathematics. Let’s implement the code in Python.

5. Code:

NB: Although we defined the regularization param as λ above, we have used C = (1/λ) in our code so as to be similar with sklearn package.

5. Results and Demo:

Let’s fit the classifier on a dummy dataset and observe the results:

The decision boundary plot:

As we can see, our model is able to classify the observations very well. The boundary is the decision line.

Get a sandbox experience for the model here: LIVE PREVIEW.

6. Future Works and Conclusions:

There is scope to improve the Classifier performance by implementing other algorithms like Stochastic Average Gradient, Limited-memory BFGS, to solve the optimization problem.

We can also implement Lasso or L1 regularization.

And that’s all. Thank you for reading my blog. Please leave comments, feedback, and suggestions if you feel any.

Reach out to me through my Portfolio or find me on LinkedIn.

--

--