The variance-covariance matrix of fitted coefficients (Image by Author)

A Deep Dive into the Variance-Covariance Matrices of Classical Linear Regression Models

A tutorial on how the variance-covariance matrices are constructed and when to be skeptical about the data they report

Published in

Towards Data Science

18 min readMar 28, 2022

I am basing this article on a topic that I hope will be a joy to read for those of you who are mathematically inclined, and also for those of us who have wondered what standard errors are or where they come from.

As a quick refresher of concepts: the variance is a measure of a random variable’s “spread” or variation around its mean (a.k.a. its expected value), while the co-variance measures how correlated are the variations of two random variables with each other.

The Standard error is simply the standard deviation of the error, where “error” has different meanings depending on whether it is used in the context of a regression variable’s coefficient or in the context of the prediction of the model as a whole. We’ll get into that soon.

The variance-covariance matrix is a square matrix i.e. it has the same number of rows and columns. The elements of the matrix that lie along its main diagonal i.e. the one that goes from top-left to bottom-right contain the variances while all other elements contain the co-variances. Thus, the variance-covariance matrix of the fitted coefficients of a regression model contains the variances of the fitted model’s coefficient estimates along its main diagonal, and it contains the pair-wise co-variances between coefficient estimates in the non-diagonal elements.

Similarly, the variance-covariance matrix of the error terms of the regression model contain the variance of each error term along its main diagonal and the covariances between all pairs of error terms in the other places.

In my previous article, we got an intuitive feel for why the fitted model’s coefficients or the error terms exhibit variances, and what role these matrices play in regression modeling.

In this article, we’ll use the Classical Linear Regression Model (CLSR) as the vehicle for showing how the formula for the variance-covariance matrix of coefficient estimates, and the variance-covariance matrix of errors, emerges seemingly naturally from the fundamental assumptions of the CLSR model and its solution. Even though the material I have presented is for the linear model, the overall conceptual approach carries through to many other popularly used regression models.

This article is PART 2 of the following two part series on variance-covariance matrices:

PART 1: An Illustrated Guide To The Variance-Covariance Matrices Used In Regression Analysis
PART 2: A Deep Dive into the Variance-Covariance Matrices of Classical Linear Regression Models

There is some amount of very simple linear algebra (read: matrices and operations on matrices) that we will encounter in this article.

To begin with, we’ll quickly (re)acquaint ourselves with a few basic linear algebra concepts.

A short primer on Linear Algebra

Matrix multiplication: Two matrices A and B can be multiplied if number of columns of A are equal to the number of rows of B. The resulting matrix’s dimensions are the number of rows of A times number of columns of B. The following figure illustrates this product operation. For simplicity, we have dropped the column subscript of the β matrix.

Matrix multiplication of X and β (Image by Author)

The transpose of a matrix: The transpose of a matrix results from interchanging the rows and columns of the matrix. It’s as if the matrix is turned on its side. Here is the regression matrix X and it’s transpose X’ (pronounced X prime):

The transpose X’ of the regression matrix X (Image by Author)

Multiplication of X’ and X: The multiplication of X’ of size (k x n) with X of size (n x k) yields a matrix of size (k x k). As we shall soon see, the X’X matrix appears everywhere in regression analysis.

Let’s revisit the automobiles data set that we saw in my previous article on variance-covariance matrices:

A subset of the Automobiles data set (Source: UC Irvine)

We’ll carve out a regression matrix X containing the following 3 regression variables:

City_MPG
Engine_Size
Curb_Weight

The regression matrix X (Image by Author)

Here’s how X’ looks like:

The transpose of the regression matrix X (Image by Author)

And here is the X’X matrix for the sample of size 205 from our automobile data set. As you can see, X’X is a (3 x 3) matrix:

***X’X*** for a sample of size 20 from the auto data set (Image by Author)

Here are a couple of useful identities associated with the transpose operation:

The transpose of the product of two matrices is the product of the transpose of individual matrices in reverse order:
(AB)’ = B’A’
The transpose of a transposed matrix gives you back the original matrix: (X’)’ = X

Identity Matrix: The identity matrix contains 1s along the main diagonal (the one going from top-left to bottom-right), and zeros in all other elements. It is usually denoted as I_n (I subscript n) or simply I and is a square matrix of dimension (n x n). The Identity matrix is conceptually a multi-dimensional version of the number 1. I_n has the following properties:

I_n raised to any power equals I_n. For e.g. I³=I
Any matrix Z of size (m x n) multiplied by I_n yields back Z. i.e. ZI = Z
Inverse of I_n is I_n. This is conceptually akin to knowing that 1 divided by 1 is, well 1.
Any invertible square matrix Z of size (n x n) times its inverse yields I_n. This last property brings us to the next concept — the inverse of a matrix.

Inverse of a matrix: The inverse of a matrix is conceptually the multi-dimensional equivalent of the inverse of a scalar number N. The inverse of a matrix is calculated using a somewhat complex formula and we’ll skip the formula for now. In regression analysis, the things to remember about a matrix’s inverse are the following:

The inverse of X is denoted as X^(-1) i.e. X with a superscript -1.
Only ‘invertible’ matrices have an inverse. In the scalar world, zero does not have a meaningful inverse. 1/0 = ??. It’s the same concept with matrices.
The inverse of a square matrix X^(-1) times X equates to I_n. i.e. X^(-1)X=I. Just as in the scalar world, 1/n times n is 1.

Now let’s apply these concepts to the topic at hand which is the derivation of covariance matrix for the linear regression model.

Formulation of the variance-covariance matrix of a regression model

We’ll start with the equation of the linear regression model as follows:

The linear regression model (Image by Author)

In the above equation, y is the dependent variable, X is the matrix of regression variables, β is the vector of regression coefficients β_1, β_2, β_3, …, β_k containing the population level values of each coefficient (including the intercept β_1), and ϵ is the vector of error terms. ϵ is the difference between the observed value of y and the modeled value of y. The error terms reflect the effect of all that which is not or cannot be modeled.

We assume that the first column of X is a column of 1s holding the place for the intercept of regression β_1.

Here’s the matrix form:

The matrix form of the linear regression model (Image by Author)

Solving (a.k.a. “fitting”) this model on a sample of size n, yields estimated values of β which we will denote as β_cap. Thus, the fitted model looks like this:

The fitted linear model (Image by Author)

In the above equation, e is the vector of residual errors (a.k.a. residuals). The residual e is the difference between the observed value of y and the value y_cap that is predicted by the fitted model.

If the model is fitted using the least squares minimization technique, known as Ordinary Least Squares (OLS), we are seeking to minimize the following sum of squares of residual errors, also called Residual Sum of Squares (RSS):

Sum of square of residuals (Image by Author)

In the above equation, x_i is the ith row of the regression matrix X.

We’ll now derive a matrix based formulation for the Residual Sum of Squares.

Let’s reexamine the vector of error terms ϵ in equation (1). For a sample of size n, ϵ is a column vector of size (n x 1):

Vector of regression errors (Image by Author)

Each ϵ_i is a random variable that has a certain mean (a.k.a. expectation) that is conditional upon X, and a certain variance around that mean that is also conditional upon X.

The column vector of conditional expectations of ϵ is represented as E(ϵ|X):

The conditional expectation of regression errors (Image by Author)

Subtracting E(ϵ|X) from ϵ yields the following:

Mean centered error terms (Image by Author)

Transposing this mean-centered column vector yields the following row vector:

Transpose of the mean centered error terms (Image by Author)

Next, let’s multiply (ϵ — E(ϵ|X))’ with (ϵ — E(ϵ|X)). Notice that (ϵ — E(ϵ|X))’ is a row vector of size (1 x n) and (ϵ — E(ϵ|X)) is a column vector of size (n x 1). Hence the product (ϵ — E(ϵ|X))’ (ϵ — E(ϵ|X)) is a (1 x 1) matrix, i.e. for our purposes, a single number. The following diagram illustrates the multiplication of (ϵ — E(ϵ|X))’ with (ϵ — E(ϵ|X)):

Solving procedure for (ϵ *— E(*ϵ|X))’ (ϵ *— E(*ϵ|X)) (Image by Author)

The third step in the above procedure is where we take the product of the two matrices shown in the second step.

Here, we get an important insight: the summation in the final step is (n-k) times the the conditional variance of the random variable ϵ. Thus, we have reached the following important result:

Conditional variance of regression errors (Image by Author)

Now, one of the chief assumptions in linear regression is the following:

For any given value of the regression variables vector x_i, the mean value of the error term is not correlated with x_i.

This assumption is known as the mean independence property of the regression model. It implies that conditioned upon X, the mean of ϵ is zero. Mean independence implies that in the above formula for variance, the term E(ϵ|X) is 0 where 0 is a column matrix of size [n x 1] containing only zeroes. Hence, the above formula for error variance of ϵ reduces to the following:

The conditional variance of the error term (Image by Author)

Let’s revisit the derivation shown in equation (4). In the first step, we will apply our assumption of mean independence by substituting E(ϵ|X) with 0. Similarly, in the last step, we will substitute E(ϵ_i|X) with 0 for all i in the summation. Doing these two substitutions gets us another important formula for the sum of squares of error terms of a regression model:

Formula for the sum of squares of regression errors (Image by Author)

Notice that nowhere in these derivations have we assumed that the model is linear. We have only made one assumption so far, namely that of mean independence.

We will now make a second assumption:

The residual errors e of the fitted model are unbiased estimators of the error terms ϵ.

This assumption allows us to estimate the sum of squares of errors using the sum of squares of residuals. In other words:

***e’e*** as an unbiased estimator of the sum of squares of regression errors (Image by Author)

Let’s once again revisit the linear regression model. In order to fit the model on a sample of size n using the Ordinary Least Squares (OLS) estimation technique, we need to minimize the residual sum of squares given by equation (3). We have also shown that the summation on the L.H.S. of equation (3) can be estimated using the matrix product of transpose of residual errors times itself, i.e. e’e.

We also know that the fitted model’s equation can be written as: y = Xβ_cap + e, or, in terms of e, e=(y — Xβ_cap). Thus, we need to minimize the following so as to fit the linear model using the least squares technique:

To do so, we will take the partial derivative of the above equation and set it to the zero matrix 0 as follows:

The partial derivative of sum of squares of residual errors w.r.t. ***β_cap*** *(Image by Author)*

Since β_cap is a vector of size [k x 1], the above matrix equation dissociates into k equations in k variables which can be easily solved to obtain β_cap=[β_1_cap, β_2_cap … β_k_cap]. We won’t actually solve this system of k equations. Instead we’ll directly state the end result, which is the value of β_cap that will minimize the residual sum of squares:

The least squares estimator for β *(Image by Author)*

The above elegant, closed-form equation gives us in single line, the least squares estimator for the linear regression model that is fitted using the OLS technique.

Let’s recollect that our goal is to arrive at the formula for the variance-covariance matrix for the regression coefficients β. To that end, we’ll follow the approach suggested in Greene (2018) and substitute y with Xβ + ϵ in equation (9) as follows:

Substituting y with Xβ + ϵ *(Image by Author)*

Let’s distribute out the terms inside the blue colored bracket on the R.H.S.:

Distributing out the terms (Image by Author)

Let’s recolor the two terms as follows:

The gray bit in the first term of the above equation is the multiplication of a matrix with itself and it equates to the identity matrix I. And we know that Iβ = β. We’ll replace the magenta bit in the second term with the placeholder matrix A.

Let’s remember this result.

Let’s recollect that the variance of the least squares estimate of β, namely β_cap is what we are after.

In the earlier part of this article, we have seen that β_cap is a random variable. The variance of any random variable Z can be expressed as the expectation of the square of its mean-subtracted a.k.a. “de-meaned” value.

Thus, Var(Z) = E[(Z — Z_bar)²].

We know that the mean of β_cap is the population level value of coefficients β. Since β and β_cap are (k x 1) column vectors, we can concisely express the formula for variance by taking the expectation of the multiplication of (β — β_cap) and its transpose (β and β_cap)’.

Conditional variance of ***β_cap*** *(Image by Author)*

The matrix (β — β_cap) is of size (k x 1) and its transpose (β — β_cap)’ is of size (1 x k) and therefore their product is a square matrix of size (k x k) which is exactly what we want in a covariance matrix.

In the above equation for variance, let’s replace β_cap with (β + Aϵ) and simplify the resulting expression as follows:

Let’s inspect the last line. The center term E(ϵϵ’)|X] deserves some explanation.

E(ϵϵ’)|X] is the conditional expectation of ϵϵ’ given X. We know that ϵ is a column vector of size (n x1) and so its transpose is a row vector of size (1 x n). Therefore, their product ϵϵ’ is a square matrix of size (n x n) as follows:

The conditional variance-covariance matrix of error terms of the model (Image by Author)

Since the mean of error terms is assumed to be zero, the diagonal elements containing E(ϵ_i*ϵ_i)|X] are actually the conditional variance of the error terms ϵ_i and the non-diagonal elements E(ϵ_i*ϵ_j)|X] are the conditional covariance of error terms ϵ_i and ϵ_j.

We will now make two important assumptions:

Homoskedastic errors
The error terms of the linear regression model are homoskedastic, i.e. they have constant conditional variance. Thus, E(ϵ_i*ϵ_j)|X] is some constant σ² when i = j.
This assumption causes all the expectation terms along the main diagonal of the variance-covariance matrix of errors to have some constant value σ².
Non-correlated errors
The error terms of the linear regression model are not correlated with each other. Thus, E(ϵ_i*ϵ_j)|X] = 0 when i not equal to j.
This assumption causes all the expectation terms that are not along the main diagonal of the variance-covariance matrix of errors to have a value of zero.

With these two assumptions, the variance-covariance matrix of errors becomes the following:

The variance-covariance matrix of error terms of the regression model under the assumptions of homoskedasticity and non-correlatedness (Image by Author)

We will now substitute E(ϵϵ’)|X] with σ²I in equation (10) to get the following equation for the variance of the least squares estimator:

To simplify the above equation, we’ll simplify the gray bit on the R.H.S. using the identity (AB)’ = B’A’. We’ll use the identities A^(-1)A = I, and IA = AI = A, and (A’)’ = A while simplifying as follows:

Derivation for the variance-covariance matrix of the OLS estimator (Image by Author)

Equation (11) is one of the fundamental results in the field of statistical modeling.

As we can see, the variance-covariance matrix of the fitted regression coefficients is a function of the values of all the regression variables in the data sample and the variance of the error terms of the regression which is assumed to be constant.

In equation (11), while X is known to us, the population level variance σ² is not a known quantity. However, the sample variance s² which is the variance of the residual errors of the fitted model forms an unbiased estimate of σ². And from our earlier work on computing the variance of residual errors, particularly equations (5) and (7), we have:

The formula for the variance of the residuals (Image by Author)

Where, as usual, e’e is the product of the transpose of the vector of residual errors of the fitted model with itself, n is the sample size and k is the number of regression variables including the intercept. You may want to verify the equation on the R.H.S. yields a scalar quantity. The residual errors vector e is a column vector of size (n x 1) and e’ is of size (1 x n), hence, e’e is of size (1 x 1) and can be considered a scalar for the purposes of computing s².

Thus, we have arrived at the following unbiased estimate of the variance-covariance matrix of the coefficients of the fitted OLS regression model:

Estimate of the variance-covariance matrix of the fitted OLS regression model’s coefficients (Image by Author)

If the model has k regression variables including the intercept, the variance-covariance matrix is a square matrix of size (k x k). This can be confirmed by noting that X is a matrix of size (n x k), and so X’ is of size (k x n) and therefore the product X’X is a square matrix of size (k x k) and so is its inverse. s² acts as a scaling factor for this matrix.

As with any variance-covariance matrix, the elements lying on the main diagonal of this matrix contain the variance of the k regression variables in β_cap around their respective population mean in β, while the non-diagonal elements contain the co-variances of β_cap_i with β_cap_j.

Confidence intervals of the regression coefficient’s estimates

The square root of each diagonal element of the variance-covariance matrix is the standard deviation a.k.a. the standard error of the respective regression coefficient’s estimate in β_cap.

These standard errors can be directly plugged into the formula for the (1 — α)100% confidence interval of the regression coefficient’s estimate around it’s estimated value as follows:

Formula for confidence interval of regression coefficients (Image by Author)

In the above formula, β_cap_i is the estimated value (a.k.a. the mean) of the ith regression coefficient of the fitted model. The t value inside the square bracket is the critical value returned from the 2-sided Student’s t-distribution with (n-k) degrees of freedom where n is the sample size and k is the number of regression coefficients including the intercept. se_i_i is square root of the ith diagonal element in the variance-covariance matrix of β_cap from equation (12).

The assumptions underlying the formulation of the variance-covariance matrix of regression coefficients

The formulation for the variance-covariance matrix presented in equation (12) has emerged from the following set of 5 assumptions that we progressively made during the derivation process:

Mean-independence of errors: Error terms of the model are mean independent: E(ϵ|X)=0
Linear form: The regression model has a linear form: y = Xβ + ϵ
Residual errors e of the fitted model are unbiased estimates of the error terms ϵ.
Homoskedastic errors: The model’s errors are homoskedastic: E(ϵ_i*ϵ_j|X) = σ² (a constant) when i = j, and,
Uncorrelated errors: The model’s errors are not correlated with each other: E(ϵ_i*ϵ_j|X) =0 when i != j

When to be skeptical about the variance-covariance matrix

If the above set of assumptions are violated, then the preceding set of derivations no longer hold, and equation (12) cannot be used to compute the variance-covariance matrix of the model. If we choose to still compute the variance-covariance matrix using equation (12), the variance and covariance values in the matrix will be inaccurate. Consequently, the standard errors of the coefficients computed from the matrix will be incorrect, and we will end up reporting wrong confidence intervals for the regression coefficients.

For a given data set, the first three assumptions — mean independence, linear form and residuals as unbiased estimates of error terms —determine the viability of the linear regression model. Assumptions (4) and (5) only describe the behavior of the variances of the model’s errors. Although, assumptions (1) through (5) together lead up to the formulation of the linear OLS regression model’s variance-covariance matrix, if (1) through (3) are satisfied but (4) and /or (5) are not, the linear regression model is still a viable model for the data set at hand. But in this situation, one should not report standard errors and confidence intervals of the coefficient estimates using equation (12).

When you build and fit a linear OLS model using a statistical modeling package such as statsmodels, the software will usually report standard errors using the formula given in equation (12), and it will therefore also report the (1 — α)100% confidence interval of the regression coefficients’ estimates using these standard errors. If the model is fitted using the OLS technique, assumptions (2) and (3) — linear form and residuals as unbiased estimates of error terms — are baked into the OLS technique.

It is important to verify that assumptions (1), (4) and (5) — mean independence, homoskedasticity, and uncorrelatedness of error terms are satisfied by the fitted model. But if they are not satisfied, then one should not rely on the standard errors and confidence intervals reported by the statistical package.

As an illustration of equation (12), let’s use it to compute the variance-covariance matrix for a linear OLS model fitted on the motor vehicles data set.

First, we will fit the model on a randomly selected sample of size 50 from the overall data set. We’ll use the statsmodels package to build and fit the model as follows:

import pandas as pd
from patsy import dmatrices
import numpy as np
import scipy.stats
import statsmodels.api as sm
import matplotlib.pyplot as plt#Read the automobiles dataset into a Pandas DataFrame
df = pd.read_csv('automobile_uciml_3vars.csv', header=0)SAMPLE_SIZE = 50#Select a random sample of size SAMPLE_SIZE
df_sample = df.sample(n=SAMPLE_SIZE)#Setup the model expression in Patsy syntax
model_expr = 'City_MPG ~ Curb_Weight + Engine_Size'#Carve out the X and y matrices using Patsy
y_train, X_train = dmatrices(model_expr, df_sample, return_type='dataframe')#Build an OLS regression model using Statsmodels
olsr_model = sm.OLS(endog=y_train, exog=X_train)#Fit the model on (y, X)
olsr_results = olsr_model.fit()#Print the training summary of the fitted model
print(olsr_results.summary())

We see the following training output:

Training summary of the OLSR model (Image by Author)

One can see that the standard errors of coefficients reported by statsmodels are the following:

Standard errors reported by Statsmodels (Image by Author)

And their (1–0.05)100% CIs are the following:

95% confidence intervals reported by statsmodels (Image by Author)

Let’s manually calculate the variance-covariance matrix for the fitted coefficients using equation (12).

To use equation (12), we’ll need to calculate s², which is the variance of the residuals. It is calculated using the formula (e’e)/(n-k) as follows:

e=np.array(olsr_results.resid)#reshape it to (50 x 1)
e=e.reshape(e.shape[0],1)e_prime = np.transpose(e)resid_variance = np.matmul(e_prime,e)/(50-3)[0]print('Variance of residuals='+str(resid_variance[0][0]))

We get the following output:

Variance of residuals=25.074639949635987

The variance-covariance matrix is calculated using equation (12) as follows:

X=np.array(X_train)X_prime = np.array(np.transpose(X))coeff_covs_matrix = np.linalg.inv(np.matmul(X_prime, X))*resid_varianceprint('Variance-Covariance Matrix of coefficient estimates=\n'+str(coeff_covs_matrix))

We get the following output:

Variance-Covariance Matrix of coefficient estimates=
[[ 1.39215336e+01 -7.68038461e-03  5.11395679e-02]
 [-7.68038461e-03  5.77116102e-06 -5.62371161e-05]
 [ 5.11395679e-02 -5.62371161e-05  7.23625275e-04]]

The square-root of the main diagonal elements is the standard errors of the fitted coefficients:

coeff_std_errors = np.sqrt(coeff_covs_matrix)print('Standard errors of coefficient estimates=\n'+str(coeff_std_errors))

We see the following output:

Standard errors of coefficient estimates=
[[3.73115714e+00            nan 2.26140593e-01]
 [           nan 2.40232409e-03            nan]
 [2.26140593e-01            nan 2.69002839e-02]]

The main diagonal elements are:

[3.73115714, 0.00240232409, 0.0269002839]

These values match the standard errors reported by Statsmodels’ in the training summary:

References, Citations and Copyrights

Data set

The Automobile Data Set citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Download link

Paper and Book Links

William H. Greene, Econometric Analysis, 8th Edition, 2018, Pearson

If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on topics devoted to regression, time series analysis, and forecasting.