Multiple Linear Regression — with math and code

Published in

Towards Data Science

8 min readOct 14, 2019

Linear regression is a form of predictive model which is widely used in many real world applications. Quite a good number of articles published on linear regression are based on single explanatory variable with detail explanation of minimizing mean square error (MSE) to optimize best fit parameters. In this article, multiple explanatory variables (independent variables) are used to derive MSE function and finally gradient descent technique is used to estimate best fit regression parameters. An example data set having three independent variables and single dependent variable is used to build a multivariate regression model and in the later section of the article, R-code is provided to model the example data set.

Multivariate Regression Model

The equation for linear regression model is known to everyone which is expressed as:

y = mx + c

where y is the output of the model which is called the response variable and x is the independent variable which is also called explanatory variable. m is the slope of the regression line and c denotes the intercept. Usually we get measured values of x and y and try to build a model by estimating optimal values of m and c so that we can use the model for future prediction for y by giving x as input.

Practically, we deal with more than just one independent variable and in that case building a linear model using multiple input variables is important to accurately model the system for better prediction. Therefore, in this article multiple regression analysis is described in detail. Matrix representation of linear regression model is required to express multivariate regression model to make it more compact and at the same time it becomes easy to compute model parameters. I believe readers do have fundamental understanding about matrix operations and linear algebra. However, in the last section, matrix rules used in this regression analysis are provided to refresh the knowledge of readers.

Algebraic form of Linear Regression

Let’s say we have following data showing scores obtained by different students in a class. The scores are given for four exams in a year with last column being the scores obtained in the final exam. From data, it is understood that scores in the final exam bear some sort of relationship with the performances in previous three exams.

Source: https://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/frames/frame.html

Here considering that scores from previous three exams are linearly related to the scores in the final exam, our linear regression model for first observation (first row in the table) should look like below.

152 = a×73 + b×80 + c×75 + d

Where a, b, c and d are model parameters.

The right hand side of the equation is the regression model which upon using appropriate parameters should produce the output equals to 152. But practically no model can be perfectly built to mimic 100% of the reality. Always, there exists an error between model output and true observation. Therefore, the correct regression equation can be defined as below:

152 = a×73 + b×80 + c×75 + d ×1+ e1

Where e1 is the error of prediction for first observation. Similarly for other rows in the data table, the equations can be written

185 = a×93 + b×88 + c×93 + d×1 + e2

180 = a×89+ b×91+ c×90 + d×1 + e3

196 = a×96+ b×98+ c×100 + d×1 + e4

………………………………………………..

192 = a×96+ b×93+ c×95+ d×1 + e25

Above equations can be written with help of four different matrices as mentioned below.

Using above four matrices, the equation for linear regression in algebraic form can be written as:

Y = Xβ + e

To obtain right hand side of the equation, matrix X is multiplied with β vector and the product is added with error vector e. As we know that two matrices can be multiplied if the number of columns of 1st matrix is equal to the number of rows of 2nd matrix. In this case, X has 4 columns and β has four rows.

Rearranging the terms, error vector is expressed as:

e = Y - Xβ

Now, it is obvious that error, e is a function of parameters, β. In the next section, MSE in matrix form is derived and used as objective function to optimize model parameters.

MSE in Matrix Form

MSE is calculated by summing the squares of e from all observations and dividing the sum by number of observations in the data table. Mathematically:

Replacing e with Y — Xβ in the equation, MSE is re-written as:

Expanding above equation as follows:

Above equation is used as cost function (objective function in optimization problem) which needs to be minimized to estimate best fit parameters in our regression model. Gradient needs to be estimated by taking derivative of MSE function with respect to parameter vector β and to be used in gradient descent optimization.

Gradient of MSE

As mentioned above, gradient is expressed as:

Where,∇ is the differential operator used for gradient. Using matrix. differentiation rules, we get following equations.

The above matrix is called Jacobian which is used in gradient descent optimization along with learning rate (lr) to update model parameters.

Gradient Descent Method

The formula for gradient descent method to update model parameter is shown below.

βold is the initialized parameter vector which gets updated in each iteration and at the end of each iteration βold is equated with βnew. lr is the learning rate which represents step size and helps preventing overshooting the lowest point in the error surface. The iteration process continues till MSE value gets reduced and becomes flat.

Illustration of gradient descent method. Source: http://www.claudiobellei.com/2018/01/06/backprop-word2vec/

Example Data

In this section, a multivariate regression model is developed using example data set. Gradient descent method is applied to estimate model parameters a, b, c and d. The values of the matrices X and Y are known from the data whereas β vector is unknown which needs to be estimated. Initially, MSE and gradient of MSE are computed followed by applying gradient descent method to minimize MSE.

R-code

Read data and initialize β:

dataLR <- read.csv("C:\\Users\\Niranjan\\Downloads\\mlr03.csv", header = T)
beta <- c(0,0,0,0) ## beta initialized
beta_T <- t(beta)X = matrix(NA,nrow(dataLR),ncol = 4)X[,1] <- dataLR$EXAM1
X[,2] <- dataLR$EXAM2
X[,3] <- dataLR$EXAM3
X[,4] <- 1XT <- t(X)
y <- as.vector(dataLR$FINAL)
yT <- t(y)

Compute MSE and update β

mse <- (1/nrow(dataLR))* (yT%*%y - 2 *  beta_T%*%XT%*%y + beta_T%*%XT%*%X%*%beta)
betanew <- beta - (lr *(2/nrow(dataLR)) * (XT%*%X%*%beta - XT%*%y))

Complete code for parameter estimation

##multivariate linear regression
dataLR <- read.csv("C:\\Users\\Niranjan\\Downloads\\mlr03.csv", header = T)
beta <- c(0,0,0,0)
beta_T <- t(beta)X = matrix(NA,nrow(dataLR),ncol = 4)X[,1] <- dataLR$EXAM1
X[,2] <- dataLR$EXAM2
X[,3] <- dataLR$EXAM3
X[,4] <- 1XT <- t(X)
y <- as.vector(dataLR$FINAL)
yT <- t(y)iteration <- 1
lr = 0.00001msef = NULL
while (iteration < 10) {
  mse <- (1/nrow(dataLR))* (yT%*%y - 2 *  beta_T%*%XT%*%y + beta_T%*%XT%*%X%*%beta)
  betanew <- beta - (lr *(2/nrow(dataLR)) * (XT%*%X%*%beta - XT%*%y))
  msef <- rbind(msef,mse)
  beta <- betanew
  beta_T <- t(betanew)
  iteration  <-  iteration + 1
}plot(1:length(msef), msef, type = "l", lwd = 2, col = 'red', xlab = 'Iterations', ylab = 'MSE')
grid(nx = 10, ny = 10)print(list(a = beta[1],b = beta[2], c = beta[3], d = beta[4]))

Code for plotting output

library(plot3D)
ymod <- X%*%beta
scatter3D(dataLR$EXAM1,dataLR$EXAM2,dataLR$EXAM3, colvar = ymod,
          pch = 17, cex = 2,bty = "g",ticktype = "detailed",phi = 0,lwd=2.5, xlab = "Exam1", ylab = 'Exam2',zlab = 'Exam3')
scatter3D(dataLR$EXAM1,dataLR$EXAM2,dataLR$EXAM3, colvar = dataLR$FINAL,
          pch = 16, cex = 2,bty = "g",ticktype = "detailed",phi = 0,lwd=2.5, xlab = "Exam1", ylab = 'Exam2',zlab = 'Exam3',add = T)plot(dataLR$FINAL, ymod, pch = 16, cex = 2, xlab = 'Data', ylab = 'Model')
lines(ymod,ymod, lwd = 4, col = "green", lty = 6)
grid(nx = 10, ny = 10)
legend("topleft",c('Model-Data Points','Best fit line'), lty = c(NA,6), lwd = c(NA,4), col = c("black","green"), pch = c(16,NA))

Output

The value of MSE gets reduced drastically and after six iterations it becomes almost flat as shown in the plot below. The corresponding model parameters are the best fit values.

Minimizing MSE:

Optimized β:

The computed final scores are compared with the final scores from data. Model efficiency is visualized by comparing modeled output with the target output in the data. Coefficient of determination is estimated to be 0.978 to numerically assess the performance of the model. The plot below shows the comparison between model and data where three axes are used to express explanatory variables like Exam1, Exam2, Exam3 and the color scheme is used to show the output variable i.e. the final score.

Comparison between model output and target in the data: