Introduction
In previous articles we have seen the theory and implementation of the Linear Regression algorithm:
- Ace your Machine Learning Interview – Part 1
- Linear Regression and Gradient Descent Using Only Numpy
Now let us see how we can use a variant of Linear Regression for classification problems.
Logistic Regression
What we want to do is to build a model that outputs the probability that a given input belongs to a certain class.
Suppose our input data is represented by a vector x and the parameters of our straight line are represented by the vector θ. We can simply calculate a first score that is commonly called logit in the following way.
Once we get this score, we are still incapable of saying whether our x belongs to a class with probability p. How do we get this probability?
Suppose we are in a binary classification problem, so a data item can only belong to class 0 or class 1.
What you do is estimate the probability using a sigmoid function. So our logit will be the input to that function.
Now let us visualize how the sigmoid function behaves on a plot.
You see that at the y-axis point 0.5 the function is somehow split in half. We can then say that if the score we get is less than 0.5 we will classify our x as 0 otherwise as 1. More formally:
Our goal though is to find the best values of θ so that we are able to classify (almost) each instance correctly. So as I showed in the linear regression case that I invite you to reread here, we again need a cost function that quantifies for us the error made by the model. The cost function for a single instance is as follows.
I invite you to think for a moment about why the cost function above makes sense. It is easier to understand why this cost function makes sense if you visualize the graphs of the logarithms.
- log(x) grows very large when x approaches 0, so the cost will be high if the model estimates a probability close to 0 for a positive instance ( instance 1).
- Symmetrically, the cost will be very high if the model estimates a probability close to 1 for a negative instance (instance 0).
The cost I displayed above was the cost of a single instance, if we want the cost of the entire dataset we simply average all the individual instances.
Notate all’interno delle parentesi quadre solo un addendo della sommatoria (quindi la parte a sinistra o la parte a destra) sara attiva per volta. Poiche se y_i = 1 sara attiva la parte a sinistra altrimenti quella a destra. Perdeteci un po di tempo perche anche se a primo impatto puo sembrare una funzione complicata è in realta molto semplice.
Multinomial Logistic Regression
Logistic Regression can be generalized for multiple classes. When given an input x, the model first computes a score _sk(x) for each class k (similar to the logit), then estimates the probability of that particular class using a Softmax function (also called Normalized Exponential).
So first let’s compute the score for each class.
As before, from the scores, we can compute the probability of each class using the softmax function.
Given the probabilities for each class, which class do we choose? Well certainly the most likely, so formally argmax over all classes.
But which cost function should be used in a multinomial logistic regression? Well, the Cross-Entropy.
Let’s code!
Let us now have a look at how to implement a simple logistic regression using sklearn! I would like to point out that we will briefly look at the basic functionality of sklearn for this implementation, without dealing with dataset splits, preprocessing, etc. We are going to use the well-known Iris-Dataset for this script.
First, we import the necessary libraries.
We now import the dataset and visualize it with pandas. The dataset is composed of 4 numerical features, while the target is a multi-class number indicating the type of flower.
We assign the feature values to a variable x, and the target values to y.
Now we can use sklearn functions, to predict the probabilities of each class, or directly the class itself using the predict() (based on argmax) function.
Final Thoughts
Logistic regression is perhaps the most important algorithm to know when one starts studying Machine Learning for classification problems. This is a very simple algorithm to implement and performs very well on linearly separable classics. We have seen that it can be used for both binary and multiple classifications.
The End
Marcello Politi