
Introduction
For most, simple linear regression is a common starting point for understanding model-based estimation and model comparison. Regardless of whether you’re taking an introductory statistics or Data Science course, you can bet your bottom dollar that linear regression will crop up at some point. And there’s a good reason for it.
Simple linear regression provides a natural extension of simple descriptive statistics by modeling the response variable as a linear combination of only a single predictor. This simplicity not only facilitates the interpretation of model parameters but also makes estimation via ordinary least squares (OLS) easier to grasp.
While most textbook introductions will provide a detailed mathematical treatment in addition to the broader conceptual elements, when it comes to actually implementing the model it’s almost never through first principles. No matter which language you’re using there’ll almost certainly be a convenience function that fits the model for you. And why wouldn’t you use it? You don’t have to do all those tedious calculations by hand! That’s definitely a plus; though I’m a firm believer that taking some time to familiarise yourself with a model’s statistical machinery is an essential part of becoming an effective analyst and data scientist.
In an earlier post, I provided a primer on Linear Algebra that touched on some fundamental principles and operations. In this post, we’re going to build on those concepts by looking under the hood and getting our hands dirty with the matrix operations that underpin linear regression.
Simple Linear Regression in Matrix Form
Most will be familiar with the standard regression formula that models a response variable Y as a linear combination of a single predictor X:

where here I’ve adopted the convention of assuming errors are normally distributed. From here we’ll build out the matrix representation by assigning the elements to vectors and matrices.
First, we’ll put all the responses in an n-dimensional vector called the response vector:

I’ve been quite explicit here by including the size of the vector – which is actually represented as a column matrix – just so we can keep track of things. However, it’s perfectly reasonable to use a lowercase boldface y if you wanted to.
Next, the predictor X is placed into an n × p matrix called the design matrix:

where p denotes the number of columns and corresponds to the number of coefficients in the model. Note that the first column contains only ones – we’ll discuss this shortly – but this is to accommodate the intercept, which is a constant. Therefore, the number of columns in the design matrix is always one greater than the number of predictors you have. In the example above, we only have a single predictor meaning we need to estimate an intercept and a slope; so, p = 2 in this case.
The regression coefficients are also placed into a p × 1 vector called the parameter vector:

Again, p denotes the number of parameters, though here p denotes the number of rows, unlike the design matrix where p is the column dimension. This arrangement is important because we will need to do some matrix multiplication with these two objects to compute the linear predictor.
However, before we do that, there is one last thing to do – place all the error terms into an n × 1 vector:

With all of these in place, we can now express the simple Linear Regression model using matrix notation like so:

The Linear Predictor
In words, the matrix formulation of the linear regression model is the product of two matrices X and β plus an error vector. The product of X and β is an n × 1 matrix called the linear predictor, which I’ll denote here:

Now, matrix multiplication works a little differently than you might expect. I covered this in my primer on linear algebra – so if you haven’t checked it out you can find it here – but I’ll quickly run through the details now.
If we have some m × q matrix A and another q × r matrix B, then the product is an m × r matrix C (note how the q dimension drops out from the result). The reason for this change in size is because the element located at row i and column j in C is the dot product of the _i_th row in A and the _j_th column in B:

So, because the dot product takes the sum across the q dimension, this drops out from the resulting matrix.
For the simple linear regression case, the product is between an n × p matrix X and a p × 1 matrix β, which therefore results in an n × 1 matrix η. Following on from above, the (i, j) element in η is computed using the following dot product:

Here the sum is taken over p, which we know is the number of coefficients in the model, and so p = 2.
If we then substitute the dot product into the linear predictor vector and substitute in the values from the first column of the design matrix, we get the following (because j = 1 I’m going to drop that subscript to declutter the notation a little):

All of this reduces to something very familiar indeed: each element in η is just our linear regression equation applied to each value of X! Hopefully, you can see why the column of ones is included in the design matrix. This ensures that the intercept is added to each observation.
Model Errors
There are three critical assumptions relating to the error terms – or perturbations – around the linear predictor that fall out of the Gauss-Markov theorem. The first is that the expected conditional mean of the errors is zero, implying that the average of the errors should not depend on any particular value of X. This is called the zero conditional mean assumption:

Related to this is the homoscedasticity assumption which states that the variance of the errors should not be affected by the values of the independent variables. That is, the distribution of the errors should be entirely independent of any information contained within the design matrix:

The final assumption is the no autocorrelation assumption which requires that error terms are uncorrelated. This implies that knowledge about one error term does not confer any information about another and therefore does not co-vary:

Under these assumptions, the covariance matrix for the error terms is a scalar matrix and the errors are considered to be spherical:

Just a note before moving on, while I adopted the convention of normally distributed error, the Gauss-Markov theorem does not require the error terms to be normal, nor does it require that errors are independent and identically distributed; only that the error terms are homoscedastic and uncorrelated. This implies that a variable can have dependencies, but so long as those dependencies are not linear – which is what correlation measures – then parameter estimation can proceed safely.
Parameter Estimation via Ordinary Least Squares
When fitting a linear regression model to data the goal is to estimate the unknown model coefficients contained in β. __ The usual approach to finding the best values for these unknowns is to satisfy the least squares criterion, the goal of which is to minimize the total squared error between the linear predictor and the response. I’m going to step through how this is implemented for the simple linear regression model, though will move through this section fairly quickly.
In matrix form, the vector of errors, or residuals, is defined as follows:

where the hat on top of β denotes the estimated coefficients. The sum of the squared residuals can then be written as the dot product of the error vector with itself:

where T denotes the transpose operator¹. To derive the least squares criterion it’s convenient to write out the errors in full as follows:

The idea, then, is to find the parameters that minimize this value. To do so, we need to take the derivative of the above with respect to the vector β and set this to zero:

from which the normal equation can be derived:

From here, we can find a solution for our unknown parameters by multiplying each side of the normal equation by the inverse of _X_ᵀX. I cover matrix inversion in this post, though if you haven’t read that, a matrix A is invertible if there exists a matrix B such that their product returns the identity matrix, I.
For a simple linear regression model, _X_ᵀX is square with a size of 2 × 2, though more generally the matrix will be a p × p matrix. We then need to find another 2 × 2 matrix that is the multiplicative inverse of _X_ᵀX. If such a matrix does not exist then the equation cannot be solved, though if _X_ᵀX is indeed invertible, then we can obtain the parameter vector b as follows:

Here you can see why it is necessary that the design matrix is not rank deficient. If there are indeed linear dependencies between columns then _X_ᵀX cannot be inverted and a unique solution cannot be found². This is particularly true if you have perfect multicollinearity.
A Closer Look at Parameter Estimation
It is interesting to further consider the elements contained in each matrix. First, let’s look at the cross-product of the design matrix _X_ᵀX:

From here we can see that the matrix contains products of each column in the design matrix. But what we need is the inverse of this matrix. I won’t go through how to derive the inverse, but it looks like this:

Finally, we also need the cross-product of the design matrix and the response vector Y, which produces the following:

With the matrices written out in full, we can substitute them into the estimation question and work through the calculation like so:

To be fair, there is a bit going on in that derivation, but it’s really the last line that is most interesting. What this all boils down to is something quite convenient; we can estimate the slope coefficient using the sample covariance and variance like so:

Once we have that, we can use this estimate, along with the mean of y and x, to derive the estimate for the intercept:

Fitted Values
We have now used some linear algebra to find the best-fitting parameters for a simple linear regression model. Now that we have these in hand the next step is to see how well the fitted values correspond with the values contained in the response vector.
All we need to obtain the fitted values is the design matrix along with the parameter vector b. Multiplying together, the fitted values are computed like so:

Note that our fitted values have a hat placed on top of them to indicate that they’re estimated values. For this reason, an alternative way to express the fitted values is as a combination of the response vector and the hat matrix, which gets its name from the fact that it puts a hat on Y.
To see this, let’s write out the equation for the fitted values by substituting b with the full equation for the parameter vector:

Basically, every term that involves the design matrix X is lumped together to form the definition of the hat matrix:

Final Remarks
If you’ve made it to this point, thanks for sticking at it. There was a lot to unpack, for sure. However, I hope you can see how the principles of matrix algebra are used to build simple linear regression models. Now, while I have focussed on simple linear regression throughout to keep the examples concise, everything discussed here works the same way for multiple regression, too. All that changes is that the dimensionality of the matrices and vectors increases.
Footnotes
[1] The notation I have used here is different from what I used in my primer article, but they mean exactly the same thing. The reason for the change is to keep in step with how the matrix formulations are typically described.
[2] Strictly speaking, singular matrices can be inverted. The generalised inverse extends the idea of the inverse and can be applied to a broader class of matrices. The generalised inverse is particularly useful in finding least squares solutions in cases where no unique solution exists.
Related Articles
Thanks for reading!
If you enjoyed this post and would like to stay up to date then please consider following me on Medium. This will ensure you don’t miss out on any new content.
To get unlimited access to all content consider signing up for a Medium subscription.
You can also follow me on Twitter, LinkedIn, or check out my GitHub if that’s more your thing.