A Tour to Machine Learning and Deep Learning

An Introduction to Logistic Regression

Deep dive in logistic regression from theory to practice

Yang
Towards Data Science
7 min readMar 27, 2019

--

This blog will cover five topics and questions:

1. What is logistic regression?

2. Why not use linear regression?

3. Maximum likely estimation (MLE)

4. Gradient decent for logistic regression

5. Implement logistic regression in Python

  1. What is Logistic Regression?

Logistic regression is a traditional and classic statistical model, which has been widely used in the academy and industry. Unlike linear regression, which is used to make a prediction on the numeric response, logistic regression is used to solve a classification problem. For example, when a person applies a loan from a bank, bank is interested in whether this applicant will default in the future? (default or not default)

One solution is to make a prediction on the applicant future status directly, such as Perceptron, which is foundation to SVM and Neural Network. Please read my blog on Perceptron:

The other solution, such as logistic regression, is to make a prediction on the probability that applicant will default. Due to the nature of probability, the prediction will fall in [0, 1]. By the rule of thumb, if the predicted probability is greater or equal to 0.5, then we can label this applicant as ‘default’; if the predicted probability is smaller to 0.5, then we can label this applicant as ‘not default’. However, the range of linear regression is from negative infinite to positive infinite, not in [0, 1]. Then sigmoid function is introduced to solve this problem. The expression of sigmoid function is:

The sigmoid function gives an S-shaped curve and saturates when its argument is very positive or very negative. Take a moment to note down the formula. We will apply it later in the maximum likely estimation.

The figure of sigmoid function:

Sigmoid function has many properties, including:

In logistic regression, we can write:

The derivative of function is shown below, which will be used to calculate gradient of cost function.

2. Why not use Linear Regression?

An Introduction to Statistical Learning gives a straightforward explanation why logistic regression is used for classification problem, instead of linear regression. First of all, the range of linear regression is negative infinite to positive infinite, which is out of the boundary of [0, 1]. If both linear regression and logistic regression make a prediction on the probability, linear model can even generate negative prediction, while logistic regression does not have such problem. See the figure below.

Figures Source: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, 2017, An Introduction to Statistical Learning

Another problem for linear regression is that the prediction of linear model is always cardinal rather than nominal. Even though in some cases the value of response did take on a natural ordering, such as bad, neutral and good. It seems that the gap between bad and neutral is same as that between neutral and good, so it is reasonable to code bad, neutral and good into 1, 2 and 3 correspondingly. However, there is no natural or direct way to convert a nominal response into a numeric response.

3. Maximum Likely Estimation (MLE)

Take one sample from whole population. This record follows a Bernoulli distribution.

In this formula, y is an indicator either 1 or 0 and p is the probability that event happens.

What if there are N records in total and what is the probability? In brief, assuming each record is independent and identically distributed (I.I.D) we can time N records’ probability together.

Then take a log on both sides and we can get log likelihood.

Note that in the formula, p is the parameter (probability) needs to be estimated, which equals:

In statistics, maximum likelihood estimation (MLE) is widely used to obtain the parameter for a distribution. In this paradigm, to maximize log likelihood is equal to minimize the cost function J. It is a dual problem in Convex Optimization. The cost function J is provided below:

This section shows the relation between MLE and cost function and how sigmoid function is embedded in the MLE. The next question is how to calculate p and further to calculate w to minimize the cost function.

4. Gradient Decent for Logistic Regression

Unlike linear regression, which has a closed-form solution, gradient decent is applied in logistic regression. The general idea of gradient descent is to tweak parameters w and b iteratively to minimize a cost function. There are three typical gradient decent, including Batch Gradient Decent, Mini-batch Gradient Decent and Stochastic Gradient Decent. In this blog, Batch Gradient Decent is used.

Figure Source: https://saugatbhattarai.com.np/what-is-gradient-descent-in-machine-learning/

The gradient of cost function J is:

Apply derivative of sigmoid function in the first section, then we can get:

An initial value is assigned to w; then iteratively update w by Learning Rate * Gradient of cost function. The algorithm will not stop until it converges.

Please see my blog for gradient descent:

5. Implement Logistic Regression in Python

In this part, I will use well known data iris to show how gradient decent works and how logistic regression handle a classification problem.

First, import the package

from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.lines as mlines

Next, load data. For simplicity, I only choose 2 kinds of iris.

# Load data
iris = datasets.load_iris()
X=iris.data[0:99,:2]
y=iris.target[0:99]
# Plot the training points
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
plt.figure(2, figsize=(8, 6))

Pseudocode for gradient decent

   1. Initialize the parameters
Repeat {
2. Make a prediction on y
3. Calculate cost function
4. Get gradient for cost function
5. Update parameters
}

Code for gradient decent

#Step 1: Initial Model Parameter
Learning_Rate=0.01
num_iterations=100000
N=len(X)
w=np.zeros((2,1))
b=0
costs=[]
for i in range(num_iterations):
#Step 2: Apply sigmoid Function and get y prediction
Z=np.dot(w.T,X.T)+b
y_pred=1/(1+1/np.exp(Z))
#Step 3: Calculate Cost Function
cost=-(1/N)*np.sum(y*np.log(y_pred)+(1-y)*np.log(1-y_pred))
#Step 4: Calculate Gradient
dw=1/N*np.dot(X.T,(y_pred-y).T)
db=1/N*np.sum(y_pred-y)
#Step 5: Update w & b
w = w - Learning_Rate*dw
b = b - Learning_Rate*db
#Records cost
if i%100==0:
costs.append(cost)
print(cost)

Visualize cost function over time

# Plot cost function
Epoch=pd.DataFrame(list(range(100,100001,100)))
Cost=pd.DataFrame(costs)
Cost_data=pd.concat([Epoch, Cost], axis=1)
Cost_data.columns=['Epoch','Cost']
plt.scatter(Cost_data['Epoch'], Cost_data['Cost'])
plt.xlabel('Epoch')
plt.ylabel('Cost')

From figure above, we can see that at first the cost decreases dramatically; after 40,000 rounds of iteration, it becomes stable.

Visualize linear classification

# Plot linear classification
fig, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,edgecolor='k')
line=mlines.Line2D([3.701,7],[2,4.1034],color='red')
ax.add_line(line)
ax.set_xlabel('Sepal length')
ax.set_ylabel('Sepal width')
plt.show()

The red line in the figure above is the decision boundary of logistic regression. Because the iris data only contains 2 dimensions, so the decision boundary is a line. In some cases when there are 3 or more dimensions, the decision boundary will be a hyperplane.

Summary

In this blog, I explain logistic regression from theory to practice. Hope you have a better understanding of logistic regression after reading this blog. If you have interest in other blogs, please click on the following link:

Reference

[1] Ian Goodfellow, Yoshua Bengio, Aaron Courville, (2017) Deep Learning

[2] Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, (2017) An Introduction to Statistical Learning

[3] https://en.wikipedia.org/wiki/Gradient_descent

--

--