Continuing from programming assignment 2 (Logistic Regression), we will now proceed to regularized Logistic Regression in python to help us deal with the problem of overfitting.
Regularizations are shrinkage methods that shrink coefficient towards zero to prevent overfitting by reducing the variance of the model.
Going straight into the assignment, we start by importing all relevant libraries and dataset. This time, the dataset contains two tests result of microchips in a factory and we are going to use the test results to predict whether the microchips should be accepted or rejected.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("ex2data2.txt", header=None)
df.head()
df.describe()

As you can see, it is a multivariate, binary classification problem that we can solve using logistic regression.
Now to visual the data. As with the previous logistic regression visualization, each combination of x1 and x2 that leads to accepted microchips are plotted against combinations that result in a rejection
X=df.iloc[:,:-1].values
y=df.iloc[:,-1].values
pos , neg = (y==1).reshape(118,1) , (y==0).reshape(118,1)
plt.scatter(X[pos[:,0],0],X[pos[:,0],1],c="r",marker="+")
plt.scatter(X[neg[:,0],0],X[neg[:,0],1],marker="o",s=10)
plt.xlabel("Test 1")
plt.ylabel("Test 2")
plt.legend(["Accepted","Rejected"],loc=0)

Plotting the data clearly shows that the decision boundary that separates the different classes is a non-linear one. This lead to the next step of feature mapping, where we add additional polynomial terms to try and better fit the data (Normal logistic regression can only to able to fit a linear decision boundary which will not do well in this case). It is decided in the assignment that we will add polynomial terms up to the 6th power.
def mapFeature(x1,x2,degree):
"""
take in numpy array of x1 and x2, return all polynomial terms up to the given degree
"""
out = np.ones(len(x1)).reshape(len(x1),1)
for i in range(1,degree+1):
for j in range(i+1):
terms= (x1**(i-j) * x2**j).reshape(len(x1),1)
out= np.hstack((out,terms))
return out
X = mapFeature(X[:,0], X[:,1],6)
The mapFeature
function also adds a column of ones to X so we do not have to deal with it later on. Here, I decided to use np.hstack
instead of np.append
to add a new column to the numpy array. I found np.hstack
to be much neater in the code compared to np.append
that I normally used. Here, I allow degree as a parameter instead of fixing it to 6 like how it was done in the assignment, feel free to play around with different degree and compare the result.

This diagram is useful to visualize what we are doing and the polynomial terms involved.
Next, we carry on to define a function that computes the regularize cost function and gradient. Remember that the cost function now has an additional shrinkage penalty that is controlled by λ.
def sigmoid(z):
"""
return the sigmoid of z
"""
return 1/ (1 + np.exp(-z))
def costFunctionReg(theta, X, y ,Lambda):
"""
Take in numpy array of theta, X, and y to return the regularize cost function and gradient
of a logistic regression
"""
m=len(y)
y=y[:,np.newaxis]
predictions = sigmoid(X @ theta)
error = (-y * np.log(predictions)) - ((1-y)*np.log(1-predictions))
cost = 1/m * sum(error)
regCost= cost + Lambda/(2*m) * sum(theta**2)
# compute gradient
j_0= 1/m * (X.transpose() @ (predictions - y))[0]
j_1 = 1/m * (X.transpose() @ (predictions - y))[1:] + (Lambda/m)* theta[1:]
grad= np.vstack((j_0[:,np.newaxis],j_1))
return regCost[0], grad
You can use the same costFunction
code and add-on a λ term to compute the regularized cost function. Another improvement from my last Python implementation, I found that replacing np.dot
with @
is the preferred option for matrix multiplication according to the official numpy documentation. Writing @
is also much easier so I am cool with that. In case you are not familiar with numpy broadcasting, you can check it out here. Broadcasting is the reason I need to add a new axis to y using y=y[:,np.newaxis]
,to ensure that the linear algebra manipulation is as I expected. If you had noticed, I used reshape()
in the previous implementation to deal with broadcasting (Told you guys I am learning as I go). By the way, np.vstack
here add to a new row instead of a column for np.hstack
.
# Initialize fitting parameters
initial_theta = np.zeros((X.shape[1], 1))
# Set regularization parameter lambda to 1
Lambda = 1
#Compute and display initial cost and gradient for regularized logistic regression
cost, grad=costFunctionReg(initial_theta, X, y, Lambda)
print("Cost at initial theta (zeros):",cost)
The print statement print: Cost at initial theta (zeros): 0.6931471805599461
As for the optimizing algorithm, I again make use of the standard gradient descent instead of fminunc
. The python way of doing fminunc
can be found here.
def gradientDescent(X,y,theta,alpha,num_iters,Lambda):
"""
Take in numpy array X, y and theta and update theta by taking num_iters gradient steps
with learning rate of alpha
return theta and the list of the cost of theta during each iteration
"""
m=len(y)
J_history =[]
for i in range(num_iters):
cost, grad = costFunctionReg(theta,X,y,Lambda)
theta = theta - (alpha * grad)
J_history.append(cost)
return theta , J_history
theta , J_history = gradientDescent(X,y,initial_theta,1,800,0.2)
print("The regularized theta using ridge regression:n",theta)
The print statement will print:

Again, alpha, num_iters and λ values were not given, try a few combinations of values and come up with the best.
plt.plot(J_history)
plt.xlabel("Iteration")
plt.ylabel("$J(Theta)$")
plt.title("Cost function using Gradient Descent")
Using the values I stated, this is the resulting cost function against the number of iterations plot.

Decreasing cost function – Check Cost function plateau – Check
def mapFeaturePlot(x1,x2,degree):
"""
take in numpy array of x1 and x2, return all polynomial terms up to the given degree
"""
out = np.ones(1)
for i in range(1,degree+1):
for j in range(i+1):
terms= (x1**(i-j) * x2**j)
out= np.hstack((out,terms))
return out
plt.scatter(X[pos[:,0],1],X[pos[:,0],2],c="r",marker="+",label="Admitted")
plt.scatter(X[neg[:,0],1],X[neg[:,0],2],c="b",marker="x",label="Not admitted")
# Plotting decision boundary
u_vals = np.linspace(-1,1.5,50)
v_vals= np.linspace(-1,1.5,50)
z=np.zeros((len(u_vals),len(v_vals)))
for i in range(len(u_vals)):
for j in range(len(v_vals)):
z[i,j] =mapFeaturePlot(u_vals[i],v_vals[j],6) @ theta
plt.contour(u_vals,v_vals,z.T,0)
plt.xlabel("Exam 1 score")
plt.ylabel("Exam 2 score")
plt.legend(loc=0)
Plotting of non-linear decision boundary involving plotting of a contour that separates the different classes. I did some googling and this stackoverflow answer might help some of you here. The code is just simple translating of the octave code given in the assignment, for the maths and intuitive behind the code, check out the stackoverflow link given above.

Pretty good I would say!
To check the accuracy of the model, we again make use of the % of correct classification on the training data.
def classifierPredict(theta,X):
"""
take in numpy array of theta and X and predict the class
"""
predictions = X.dot(theta)
return predictions>0
p=classifierPredict(theta,X)
print("Train Accuracy:", (sum(p==y[:,np.newaxis])/len(y) *100)[0],"%")
The print statement print: Train Accuracy: 83.05084745762711%
.Close, but not as high as the 88.983051%
obtained in the assignment using fminunc
.
Next, I would like to touch on Lasso Regression, another regularization method used to prevent overfitting. With reference from ‘Introduction to Statistical Learning’, Lasso regression has a clear advantage over ridge regression (regularization that we just did). That is, while ridge regression shrink coefficients towards zero, it can never reduce it to zero and hence, all features will be included in the model no matter how small the value of the coefficients. Lasso regression, on the other hand, is able to shrink coefficient to exactly zero, reducing the number of features and serve as a feature selection tools at the same time. This makes Lasso regression useful in cases with high dimension and helps with model interpretability.
For Lasso Regression, the cost function to be minimized is very similar to ridge regression.

Instead of adding the sum of Θ squared, it use the absolute value instead, and since it involves absolute values, computing it is difficult as it is not differentiable. Various algorithms are available to compute this and one such example can be found using sklearn library.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty="l1")
clf.fit(X,y)
thetaLasso=clf.coef_
print("The regularized theta using lasso regression:n",thetaLasso.reshape(28,1))
A side-by-side comparison of the theta values.

As you can see, the lasso regression reduces several features to 0, reducing the dimension of the problem and improve model interpretability.
This is all I have for regularization. The Jupyter notebook will be uploaded to my GitHub at (https://github.com/Benlau93/Machine-Learning-by-Andrew-Ng-in-Python).
For other python implementation in the series,
- Linear Regression
- Logistic Regression
- Neural Networks
- Support Vector Machines
- Unsupervised Learning
- Anomaly Detection
Thank you for reading.