Linear regression: Comparing pythons sklearn with ‘writing it from scratch’

Ever wanted to know what happens under the hood in multivariate linear regression/gradient descent in python’s sklearn? You will be surprised how easy it is. Lets write it from scratch and apply the same evaluation method to both to see how we do.

Shaun Enslin
Towards Data Science

--

Sani2C by author — could linear regression make me cycle faster?

Introduction

So, understanding what happens in linear regression is so good from an understanding point of view. Here is a deep dive without using python libraries. Here is a link to the source code for this article in github.
If you do need an intro to gradient descent have a look at my 5 part YouTube series first.

In step 1, we will write gradient descent from scratch, while in step 2 we will use sklearn’s linear regression.

Our dataset

Let’s download our dataset from kaggle. The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The datasets are made available to public for the purpose of health data analysis. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website.

Figure 1: Screen shot from kaggle

Step 1: Linear regression/gradient descent from scratch

Let’s start with importing our libraries and having a look at the first few rows.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
df = pd.read_csv(‘Life Expectancy Data.csv’)
df.head()

You will get a nice view of the data and can see we have country and status that are text fields, while ‘life expectancy’ is the field we want to predict.

Figure 2: Screen shot from jupyter

To get a little more insight, lets run an info and we will get below info in figure 3.

  • We have 2 text fields ie. Country and Status
  • Fields such as alcohol, hepatitis B etc. have null values which we will need to resolve
  • The column names need some work
df.info()
Figure 3: Screen shot from jupyter

Prepare the data

The names are not great to work with, so lets rename some of the columns. Then convert object fields to numbers as we cannot work with text . Finally, lets move Y into its own array and drop it from `df`. The result of this steep is that `df` is our feature set and only contains numbers, while `y` is our result set.

df.rename(columns = {“ BMI “ :”BMI”,
“Life expectancy “: “Life_expectancy”,
“Adult Mortality”:”Adult_mortality”,
“infant deaths”:”Infant_deaths”,
“percentage expenditure”:”Percentage_expenditure”,
“Hepatitis B”:”HepatitisB”,
“Measles “:”Measles”,
“under-five deaths “: “Under_five_deaths”,
“Total expenditure”:”Total_expenditure”,
“Diphtheria “: “Diphtheria”,
“ thinness 1–19 years”:”Thinness_1–19_years”,
“ thinness 5–9 years”:”Thinness_5–9_years”,
“ HIV/AIDS”:”HIV/AIDS”,
“Income composition of resources”:
”Income_composition_of_resources”}, inplace = True)
# convert labels to numbers
columns = [“Status”,”Country”]
for feature in columns:
le = LabelEncoder()
df[feature] = le.fit_transform(df[feature])
# extract Y and drop from data frame
Y = df[“Life_expectancy”]
df = df.drop([“Life_expectancy”], axis=1)

Plot correlation matrix

If we have a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap. Looking at the matrix, you can see 9 columns that have the highest correlation above 0.38. We are only interested in `life expectancy`, so look at the bottom row for your results.

plt.figure(figsize = (24,16))
sns.heatmap(pd.concat([df,Y], axis=1).corr(), annot=True, cmap=”coolwarm”)

To get better results, we could choose only to use features above 0.3 in the correlation matrix.

Figure 4: Screen shot from jupyter

Fill in values for our missing features

An important part of regression is understanding which features are missing. We can choose to ignore all rows with missing values, or fill them in with either mode, median or mode. Below is a hand function to fill in missing values based on one of 3 methods:

  • Mode = most common value
  • Median = middle value
  • Mean = average
def fillmissing(df, feature, method):
if method == “mode”:
df[feature] = df[feature].fillna(df[feature].mode()[0])
elif method == “median”:
df[feature] = df[feature].fillna(df[feature].median())
else:
df[feature] = df[feature].fillna(df[feature].mean())

Now look for columns with missing values and fill them in using our handy function. Lets also fill in any missing values in Y.

features_missing= df.columns[df.isna().any()]
for feature in features_missing:
fillmissing(df, feature= feature, method= “mean”)
Y.fillna(Y.median(), inplace=True)

Get X/Y arrays

Let’s get our dataframes into arrays we can easily manipulate.

X = df.to_numpy() 
y = Y.to_numpy().transpose()
m,n = X.shape

Normalize X

Now, lets normalise X so the values lie between -1 and 1. We do this so we can get all features into a similar range. This assists if we need to plot data, but also gives better linear regression results. We use the following equation and you should see your features now normalised to values similar to figure 5.

mu = X.mean(0) 
sigma = X.std(0) # standard deviation: max(x)-min(x)
xn = (X — mu) / sigma
Figure 5: Screen shot from jupyter

Add column of ones

Now add a column of ones to X for easier matrix manipulation of our hypothesis and cost function later on. Your data should now look as per figure 6 with a column of ones.

xo = np.hstack((np.ones((m, 1)), xn))
Figure 6: Screen shot from jupyter

Gradient descent

Create the variables we need for gradient descent. We need the following variables:

  • repeat = number of times to repeat gradient descent
  • theta = a theta for each feature of X, add one more column for theta 0
  • costhistory = keep the cost of each iteration of gradient descent
repeat = 1000
lrate = 0.01
theta = np.zeros((n+1))

Lets define a cost function which gradient descent will use to determine the cost of each theta. The cost function will implement the following cost equation.

def computeCost(X, y, theta):
m = len(y) # number of training examples
diff = np.matmul(X, theta) — y
J = 1 / (2 * m) * np.matmul(diff, diff)
return J

We now go into our gradient descent loop, where we calculate a new theta on each loop and keep track of its cost. See the equation below:

Now that we see the equation, lets put it into a handy function

def gradientDescent(X, y, theta, alpha, num_iters):
# Initialize some useful values
m = len(y) # number of training examples
J_history = []
# repeat until convergance
for i in range(num_iters):
hc = np.matmul(X, theta) — y
theta -= alpha / m * np.matmul(X.transpose(), hc)
# Save the cost J in every iteration
J_history.append(computeCost(X, y, theta))
return theta, J_history

Lets run gradient descent and print the results

theta, J_history = gradientDescent(xo, y, theta, lrate, repeat)# Display gradient descent's result
print('Best theta computed from gradient descent: ')
print(f' {theta} ')

Plot the cost of gradient descent

Plot the cost history to ensure cost is decreasing with number of iterations. After plotting, you should see the cost decreasing with each iteration as in figure 7.

# Plot the convergence graph
plt.plot(np.arange(repeat), J_history, ‘-b’, LineWidth=2)
plt.xlabel(‘Number of iterations’)
plt.ylabel(‘Cost J’)
plt.show()
Figure 7: Screen shot from jupyter

Prediction

Let run our predition using the following equation.

Take note that adding a column of ones to X and then using matrix multiplication, performs above equation in one easy step.

y_pred = np.matmul(xo, theta)

Evaluate predictions

Lets use Root Mean Squared Error (RMSE) which is the square root of the mean of the squared errors. Below is the equation applied and the result will be used later for a comparision.

Lets also work out the percentage each prediction has of the true result. Then use this to find the number of predicted age that fall within 90% to 110% of the actual age. We will see this as being acceptable to calculate a final accuracy.

# get RMSE error rate
print('RMSE: ',np.sqrt(metrics.mean_squared_error(y, y_pred)))
# calculate our own accuracy where prediction within 10% is o
diff = (y_pred / y * 100)
print('Mean of results: ',diff.mean())
print('Deviation of results: ',diff.std())
print('Results within 10% support/resistance: ', len(np.where(np.logical_and(diff>=90, diff<=110))[0]) / m * 100)

You will now see results as below. If our predictions for each row is within 10% of the actual age, then we have decided to call it success. As such, we end up with an accuracy of 90%.

Figure 8: Screen shot from jupyter

Finally, lets visualise the accuracy of each prediction. naturally, 100% is a perfect prediction.

plt.plot(np.arange(m), diff, '-b', LineWidth=1)
plt.xlabel('Number')
plt.ylabel('Accuracy %')
plt.show()
Figure 9: Screen shot from jupyter

So, now that we have seen linear regression just using matrix manipulation, lets see how results compare with using sklearn.

Step 2: Using sklearn’s linear regression

Lets use sklearn to perform the linear regression for us. You can see its alot less code this time around. Once we have a prediction, we will use RMSE and our support/resistance calculation to see how our manual calculation above compared to a proven sklearn function.

from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Instantiate model
lm2 = LinearRegression()
# Fit Model
lm2.fit(X_train, y_train)
# Predict
y_pred2 = lm2.predict(X_test)
# RMSE
print('RMSE: ',np.sqrt(metrics.mean_squared_error(y_test, y_pred2)))
# calculate our own accuracy where prediction within 10% is ok
diff2 = (y_pred2 / y_test * 100)
print('Mean of results: ',diff2.mean())
print('Deviation of results: ',diff2.std())
print('Results within 10% support/resistance: ', len(np.where(np.logical_and(diff2>=90, diff2<=110))[0]) / len(y_pred2) * 100)
Figure 9: Screen shot from jupyter

Conclusion

You can see that our RMSE and support/resistance percentages were similar in both methods. Naturally, you would probably use sklearn as its alot less coding, but hope this example showed you how the equations work under the hood.

--

--