The world’s leading publication for data science, AI, and ML professionals.

Andrew Ng’s Machine Learning Course in Python (Logistic Regression)

Logistic Regression

Machine Learning – Andrew Ng
Machine Learning – Andrew Ng

Continuing from the series, this will be python implementation of Andrew Ng‘s Machine Learning Course on Logistic Regression.

Logistic regression is used in classification problems where the labels are a discrete number of classes as compared to linear regression, where labels are continuous variables.


Same as usual, we start with importing of libraries and the dataset. This dataset contains 2 different test score of students and their status of admission into the university. We are asked to predict if a student gets admitted into a university based on their test scores.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv("ex2data1.txt",header=None)

Making sense of the data

df.head()
df.describe()

Plotting of the data

pos , neg = (y==1).reshape(100,1) , (y==0).reshape(100,1)
plt.scatter(X[pos[:,0],0],X[pos[:,0],1],c="r",marker="+")
plt.scatter(X[neg[:,0],0],X[neg[:,0],1],marker="o",s=10)
plt.xlabel("Exam 1 score")
plt.ylabel("Exam 2 score")
plt.legend(["Admitted","Not admitted"],loc=0)

As this is not the standard scatter or line plot, I will break down the code step by step for easy understanding. For classification problem, we plot the independent variables against each other and identify the different classes to observe their relationship. As such, we need to differentiate the combination of x1 and x2 that led to an university admission and the combination that don’t. The variables pos and neg did just that. By plotting combination of x1 and x2 that was admitted into university with different color and marker as those that don’t get admitted, we successfully visualize the relationship.

Students with higher test score for both exam were admitted into the university as expected.

Now the sigmoid function that differentiates Logistic Regression from linear regression

def sigmoid(z):
    """
    return the sigmoid of z
    """

    return 1/ (1 + np.exp(-z))
# testing the sigmoid function
sigmoid(0)

Running the sigmoid(0) function return 0.5

To compute the cost function J(Θ) and gradient (partial derivative of J(Θ) with respect to each Θ)

def costFunction(theta, X, y):
    """
    Takes in numpy array theta, x and y and return the logistic regression cost function and gradient
    """

    m=len(y)

    predictions = sigmoid(np.dot(X,theta))
    error = (-y * np.log(predictions)) - ((1-y)*np.log(1-predictions))
cost = 1/m * sum(error)

    grad = 1/m * np.dot(X.transpose(),(predictions - y))

    return cost[0] , grad

Setting the initial_theta and test the cost function

m , n = X.shape[0], X.shape[1]
X= np.append(np.ones((m,1)),X,axis=1)
y=y.reshape(m,1)
initial_theta = np.zeros((n+1,1))
cost, grad= costFunction(initial_theta,X,y)
print("Cost of initial theta is",cost)
print("Gradient at initial theta (zeros):",grad)

The print statement will print: Cost of initial theta is 0.693147180559946 Gradient at initial theta (zeros): [-0.1],[-12.00921659],[-11.26284221]

Now for the optimizing algorithm. In the assignment itself, we were told to make use of the fminunc function in Octave to finds the minimum of an unconstrained function. As for Python implementation, a library is available that serves similar purpose. You can find the official documentation here. There are various optimization method to choose from and many others before me had used these methods for their python implementation. Here, I decided to use gradient descent to do the optimization and compare the result with fminunc in Octave.

Before doing gradient descent, never forget to do feature scaling for a multivariate problem.

def featureNormalization(X):
    """
    Take in numpy array of X values and return normalize X values,
    the mean and standard deviation of each feature
    """
    mean=np.mean(X,axis=0)
    std=np.std(X,axis=0)

    X_norm = (X - mean)/std

    return X_norm , mean , std

As mentioned in the lecture, the gradient descent algorithm is very similar to linear regression. The only difference is that the hypothesis h(x) is now g(Θ^Tx) where g is the sigmoid function.

def gradientDescent(X,y,theta,alpha,num_iters):
    """
    Take in numpy array X, y and theta and update theta by taking num_iters gradient steps
    with learning rate of alpha

    return theta and the list of the cost of theta during each iteration
    """

    m=len(y)
    J_history =[]

    for i in range(num_iters):
        cost, grad = costFunction(theta,X,y)
        theta = theta - (alpha * grad)
        J_history.append(cost)

    return theta , J_history

I always like the saying of DRY (Don’t Repeat Yourself) in coding. Since we already have a function for computing the gradient previously, let’s not repeat the calculation and add on an alpha term here to update Θ.

As the assignment did not implement gradient descent, I had to test a few alpha and num_iters values to find the optimal values.

Using alpha=0.01, num_iters=400 ,

The gradient descent works in reducing the cost function at every iteration, but we can do better. With alpha = 0.1, num_iters =400 ,

Much better, but I will try another value just to make sure. With alpha=1, num_iters=400 ,

The drop is sharper and cost function plateau around the 150 iterations. Using this alpha and num_iters values, the optimized theta is [1.65947664],[3.8670477],[3.60347302] and the resulting cost is 0.20360044248226664 . A significant improvement from the initial 0.693147180559946 . When compared to the optimized cost function using fminunc in Octave, it is not that far off from 0.203498 obtained in the assignment.

Next is the plotting of the decision boundary using the optimized theta. There is a step by step explanation within the course’s resource on how to plot the decision boundary. The link can be found here.

plt.scatter(X[pos[:,0],1],X[pos[:,0],2],c="r",marker="+",label="Admitted")
plt.scatter(X[neg[:,0],1],X[neg[:,0],2],c="b",marker="x",label="Not admitted")
x_value= np.array([np.min(X[:,1]),np.max(X[:,1])])
y_value=-(theta[0] +theta[1]*x_value)/theta[2]
plt.plot(x_value,y_value, "r")
plt.xlabel("Exam 1 score")
plt.ylabel("Exam 2 score")
plt.legend(loc=0)

Making predictions using optimized theta

x_test = np.array([45,85])
x_test = (x_test - X_mean)/X_std
x_test = np.append(np.ones(1),x_test)
prob = sigmoid(x_test.dot(theta))
print("For a student with scores 45 and 85, we predict an admission probability of",prob[0])

The print statement will print: For a student with scores 45 and 85, we predict an admission probability of 0.7677628875792492 . A close approximation to 0.776291 using fminunc .

To find the accuracy of the classifier, we compute the percentage of correct classification on our training set.

def classifierPredict(theta,X):
    """
    take in numpy array of theta and X and predict the class 
    """
    predictions = X.dot(theta)

    return predictions>0
p=classifierPredict(theta,X)
print("Train Accuracy:", sum(p==y)[0],"%")

The classifierPredict function returns a boolean array with True if the probability of admission into university is more than 0.5 and False otherwise. Taking the sum(p==y) adds up all instances where it correctly predicts the y values.

The print statement print: Train Accuracy: 89 %, indicating our classifier predict 89% of the training set correctly.


This is all for Logistic Regression. As usual, the Jupyter notebook is uploaded to my GitHub at (https://github.com/Benlau93/Machine-Learning-by-Andrew-Ng-in-Python).

For other python implementation in the series,

Thank you for reading.


Related Articles