Image by Clker-Free-Vector-Images from Pixabay (Pixabay license)

A Deep-Dive into Generalized Least Squares Estimation

A detailed look at how to fit a robust GLS model on heteroskedastic, auto-correlated data sets

Published in

Towards Data Science

15 min readNov 1, 2022

Generalized Least Squares (GLS) estimation is a generalization of the Ordinary Least Squares (OLS) estimation technique. GLS is especially suitable for fitting linear models on data sets that exhibit heteroskedasticity (i.e., non-constant variance) and/or auto-correlation. Real world data sets often exhibit these characteristics making GLS a very useful alternative to OLS estimation.

Motivation for the GLS Estimator

When we use the Ordinary Least Squares (OLS) estimation technique for fitting a linear model on a data set, we make two crucial assumptions:

The variance of the errors of the regression model is constant i.e., the errors are homoskedastic, and
The errors are not correlated with each other or with themselves.

But in real world data sets, one or both of these assumptions often don’t hold true. For instance, consider the following model of county-level poverty in the US:

A linear model for estimating county-level poverty (Image by Author)

For a given data set, a statistical package such as statsmodels can be used for estimating this model. An OLS estimation of this linear model using statsmodels on the 2015–2019 American Community Survey (ACS) 5-Year Estimates data from the US Census Bureau (see terms of use at the bottom of the article) yields the following fitted model:

An estimation of county-level poverty using OLS (Image by Author)

Here, e_i is the residual error of regression.

Here’s the model training summary generated by Statsmodels showing, among other things, that all coefficients are statistically significant at a p < .001:

Training summary of the linear model (Image by Author)

At face-value, this fitted model seems sound. But a plot of the residual errors e_i against the corresponding predicted values of Percent_Households_Below_Poverty_Level reveals a serious problem:

Plot of the fitted model’s residual errors against the estimated value of the response variable. The red lines are shown just to illustrate the increasing pattern of variance in the model’s residuals (Image by Author)

The model’s errors (as estimated by the fitted model’s residuals) are clearly heteroskedastic. In this case the error variance increases as a function of y_cap.

And that should make us suspect the standard errors, p values and confidence intervals reported by the estimation software.

Here’s what the underlying problem is.

The problem with the OLS estimator

While using the OLS estimator, we tend to assume that the model’s errors are homoskedastic and uncorrelated. Therefore, the formula we use for estimating the variance of the estimated coefficients also makes the same assumptions. But as we saw in the above example, if the errors turn out to be heteroskedastic and/or correlated, the variance formula outputs incorrect values for the variance of the estimated coefficients. An incorrect estimation of variance leads to downstream errors in the estimation of the following measures:

Incorrect standard errors of the coefficients (standard errors are the square root of the variances).
Incorrect z-scores of the coefficient estimates (z-scores of the coefficients are inversely related to the standard errors).
Incorrect p-values of the coefficient estimates. Incorrectly calculated p-values may cause some coefficients to be incorrectly reported as statistically significant (and vice-versa).
Finally, incorrect confidence intervals for the coefficient estimates.

Overall, the inference procedure yields incorrect results even though on the face of it there seems absolutely nothing wrong with the fitted model.

It can be shown that when the model’s errors are heteroskedastic and/or correlated, the OLS estimator, while still being consistent and unbiased, no longer yields the lowest possible variance estimates for the model’s coefficients. It’s at least theoretically possible to design another estimator that will produce coefficient estimates having a lower variance and thus higher precision.
In short, the OLS estimator is no longer efficient.

The remedy

A common way to deal with heteroskedastic errors (although not as much for correlated errors) is to use what’s known as the White’s Heteroskedasticity Consistent estimator. I have covered this estimator in detail in my article on “Introducing the White’s Heteroskedasticity Consistent Estimator”.

But White’s HC estimator has a few drawbacks, chief among them being that it factors in only heteroskedasticity and not correlation among the errors. A second issue is that in sample-sized samples, the White HC estimator can underestimate the variance in the coefficient estimates thereby leading to the same issues that the OLS estimator has.

Getting to the GLS estimator

A more direct approach to dealing with both heteroskedastic and/or correlated errors is to follow this two-point plan:

Use OLS to fit the linear model on the data set. Using the fitted model’s residual errors as the proxy for the linear model’s errors, create a model of the heteroskedasticity and/or correlation that is observed with the fitted model’s residuals.
Design an estimator that uses these modeled values of variance and correlation in its estimation technique.

Such an estimator will be robust to both heteroskedasticity and correlation in the model’s error term. This is exactly the approach taken by the Generalized Least Squares (GLS) estimator.

Development of the GLS technique

In this section, we’ll develop the GLS estimator from first principles, and we’ll see how to use it.

Let’s start with the following linear model:

y is the response variable. For a data set of size n, y is a column vector of size [n x 1]. Suppose the model has k regression variables including the intercept. β is a column vector of regression coefficients [β_1, β_2,…,β_k] where β_1 is the intercept. X is the matrix of regression variables including the placeholder for the intercept in column 1 of the matrix. X is of size [n x k].

The ith sample in the data set is the tuple: (y_i, x_i) where y_i is a scalar (a pure number), and x_i is a row vector of size [1 x k].

Typically, a regression model helps us “explain” some of the variance in the response variable y. What the model cannot explain “leaks” into the error term ϵ of the model. Just like y, ϵ is column vector of size [n x 1].

Here’s the matrix form of Eq (1):

It can be shown that an Ordinary Least-Squares (OLS) estimation of the coefficients vector β of this model yields the following estimator:

The vector of estimated coefficients, estimated using ordinary least squares (Image by Author)

In the above equation, X’ is the transpose of X. A matrix transpose operation essentially flips the matrix along its main diagonal i.e. the one that extends from the top-left to the bottom right. The transpose operation conceptually turns a matrix on its side. Since X is of size [n x k], its transpose X’ has a size [k x n].

Another kind of matrix operation we see in Eq (2) is the (-1) in the superscript which denotes the matrix inverse. A matrix inverse is the matrix equivalent of doing a one over a number.

Substituting Eq (1) in Eq (2), we get the following result:

After simplifying the above result bit (I have detailed out the simplification in this article), we get the following useful formula for the coefficient estimates. The following result brings out the effect of the error term ϵ on the values of the coefficients estimated by OLS:

The OLS estimator of β as a function of X and ϵ — The OLS estimator of β *as a function of* X and ϵ *(Image by Author)*

Since β_cap is the estimated value of β, β_cap is a random variable, and it has a mean and a variance.

It can be shown that the mean (a.k.a. expected) value of β_cap is the population-level value β. Specifically, the expectation of β_cap conditioned upon X is β.

The variance of β_cap conditioned upon X is denoted by Var(β_cap|X). To calculate Var(β_cap|X), we employ the following formula for (conditional) variance:

The formula for conditional variance of a matrix random variable Z expressed in terms of Z, its mean (expected) value E(Z) and the transpose of the demeaned Z — The formula for conditional variance of a **matrix** random variable Z expressed in terms of Z, its mean (expected) value E(Z) and the transpose of the demeaned Z (Image by Author)

Substituting Z = β_cap, and E(Z) = β, we have:

The conditional variance of the coefficient estimates (Image by Author)

In the R.H.S. of the above equation, we substitute β_cap with β + Aϵ from Eq (3) and after a considerable amount of simplification (details of which are covered here), we arrive at the following formula for the variance of the estimated coefficients:

The formula for variance of the OLS estimated coefficients of a linear regression model (Image by Author)

The central term (colored in blue) deserves some attention.

Recollect that ϵ is the error term of the model. Therefore, by definition, ϵ is a random variable. ϵ is a matrix of size [n x 1], and ϵ’ is of size [1 x n]. Thus, by the rules of matrix multiplication, ϵϵ’ is a matrix of size [n x n]. ϵϵ’ is also a random variable.

E[(ϵϵ’)|X] is the expectation of the random variable (ϵϵ’) conditioned upon X. E[(ϵϵ’)|X] is a square matrix of size [n x n] and it looks like this:

The conditional variance-covariance matrix of error terms of the model (Image by Author)

Since the mean of error terms is assumed to be zero (a central assumption of the linear model), the elements along the main diagonal i.e. the one that extends from the top-left to the bottom-right of the above matrix containing E[(ϵ_i*ϵ_i)|X] are actually the conditional variance of the error terms ϵ_i, and all non-diagonal elements E[(ϵ_i*ϵ_j)|X] are the conditional covariance of error terms ϵ_i and ϵ_j.

Thus, E[(ϵϵ’)|X] is the variance-covariance matrix; in short, the covariance matrix of the error term.

In statistical literature, the covariance matrix of errors is often represented as follows:

The covariance matrix of regression errors (Image by Author)

σ² is a scaling factor, a constant, that we extract out of the matrix such that ω_ij=ρ_ij/σ². The resulting [n x n] matrix of “omegas” is denoted by the uppercase Greek letter Ω.

Therefore,

E[(ϵϵ’)|X] = σ²Ω

The main diagonal elements of Ω contain (scaled) variances of errors and all other elements of Ω contain the (scaled) covariances of errors.

If the model’s errors are homoskedastic (constant variance), all the main diagonal elements of Ω are 1, i.e., ω_ii = 1 for all i.

If the errors are uncorrelated, all non-diagonal elements of Ω are 0.

Hence, for homoskedastic, uncorrelated errors, the covariance matrix takes the following form:

The covariance matrix of the regression model’s errors when the errors are homoskedastic and non-auto-correlated (Image by Author)

In Eq(5), I is the identity matrix of size [n x n]. Substituting Eq (5) in Eq (4) and simplifying a bit, we get this nice little result:

The formula for the covariance matrix of the fitted regression coefficients when the model’s errors are homoskedastic and non-auto-correlated — The formula for the covariance matrix of the fitted regression coefficients when the model’s errors are **homoskedastic** and **non-auto-correlated** (Image by Author)

But if the model’s errors are heteroskedastic and/or correlated, the covariance matrix of errors is no longer σ²I, and the above formula for variance yields incorrect results leading to all the problems that we went over at the beginning of the article.

The Generalized Least Squares approach

A direct approach to fixing this issue is GLS. We start by defining a square matrix C of size [n x n] and a diagonal matrix D of size [n x n] such that the covariance matrix Ω can be expressed as a product of C, D and the transpose of C as follows:

Under certain conditions (which we won’t go into here except to say that C is what’s known as an orthogonal matrix), it is possible to always find two such matrices C and D. Incidentally, a diagonal matrix is one in which all elements that are not along the main diagonal are zero. Here’s how D looks like:

The main-diagonal elements d_ii may or may not all have the same value. The Identity matrix I is an example of a diagonal matrix in which all diagonal elements are 1.

Next, we define a matrix G such that its transpose is the following multiplication of matrices C and D:

The exponentiated version of D deserves an explanation. If the diagonal elements of D are d_ii, then D raised to the power (-1/2) is essentially the inverse of the “square root” of the matrix, and it contains diagonal elements 1/√d_ii, as follows:

The diagonal matrix D, and the inverse “square root” of D (Image by Author)

It should soon be clear why we are doing these mysterious transformations.

Let’s recall the equation of the linear model which we stated all the way at the beginning of the article, in Eq (1):

We’ll left-multiply both sides of Eq (1) by G as follows:

A transformed linear model (Image by Author)

Let’s convince ourselves that the model in Eq (9) is still a linear model. G is a square matrix of size [n x n]. Since y is a column matrix of size [n x 1], Gy is simply a scaled version of y of size [n x 1]. Similarly, X is of size [n x k] and so (GX) is a scaled version of X of size [n x k]. And (Gϵ) is a scaled version ϵ of size [n x 1]. Thus, Eq (1) is a regression of a scaled (some might say, transformed) version of y on a correspondingly scaled (transformed) version of X. The coefficients β of the original linear model will work just as well for the scaled model of Eq (9).

For convenience, we’ll substitute Gy with y*, GX with X* and Gϵ with ϵ*:

The transformed (scaled) linear model (Image by Author)

Since Eq (10) is a valid linear model, all results that are applicable to linear models hold true for it. To start with, the variance of the error term ϵ* can be stated as follows:

The variance of the error term ϵ* of the scaled linear model (Image by Author)

In Eq (1), we have replaced ϵ* with Gϵ on the R.H.S. We simplify the R.H.S. of the above equation as follows:

Working through the formula for the covariance matrix of the scaled model’s errors (Image by Author)

We’ll now work on the yellow bit in Eq (12). For that, we’ll use equations (7) and (8) for Ω and G which we’ll reproduce below:

We’ll use these substitutions in Eq (12) as follows:

In Eq (13), we’ve used the identity (G’)’ = G, i.e., the transpose of a transpose gives us back the original matrix.

Let’s simplify the R.H.S. of Eq (13):

Simplification of the term GΩG’ — Simplification of the term ***GΩG’*** *(Image by Author)*

We noted earlier that C is an orthogonal matrix. A property of an orthogonal matrix is that its transpose equals its inverse:

Transpose of an orthogonal matrix is the same as its inverse (Image by Author)

This implies that the product of C’ and C is same as the product of C and its inverse. The product of C and its inverse is the identity matrix I. This result is the matrix equivalent of N times (1/N) equals 1.

This implies C’C = I. Let’s continue with our simplification of GΩG’:

To further simplify the R.H.S. of Eq (14), we must recollect the nature of the diagonal matrix D, and D raised to (-1/2) power:

Furthermore, the transpose of a diagonal matrix is the same matrix since the transpose operation simply flips a matrix around its main diagonal and in a diagonal matrix all non-diagonal elements are 0.

With this setup understood, it can be shown that the following product equates to an identity matrix of size [n x n]:

Plugging this result into Eq (14), we get:

GΩG’ resolves to the [n x n] identity matrix — **GΩG’** *resolves to the* [n x n] *identity matrix (Image by Author)*

Plugging Eq (15) into Eq (12) and then into Eq (11), we get the formula for the covariance matrix of errors for the scaled regression model y* = X*β + ϵ* where y* = Gy, X* = GX, and ϵ* = Gϵ. We summarize this result below:

The covariance matrix of errors of the scaled linear model Gy = GXβ + Gϵ — The covariance matrix of errors of the scaled linear model y* = X*β + ϵ* (Image by Author)

Eq (16) is an important result in the development of the GLS technique. It states that the errors of the scaled linear model are homoskedastic (i.e., constant variance) and uncorrelated. Therefore, a least squares estimator of β for this linear model will be necessarily efficient i.e. of lowest possible variance (in addition to be consistent and unbiased).

This leads us to an important result:

Even if the data exhibits heteroskedasticity and/or auto-correlation, the scaled (transformed) linear regression model that we have developed can be fitted using a least squares estimator that would be efficient, consistent and unbiased, in other words, it would be the Best Linear Unbiased Estimator for this model.

How do we develop such a least squares estimator for the scaled model? We do it as follows:

Recollect Equation (2) which states that the least squares estimator of β for the linear model y = Xβ + ϵ is given by the following formula:

In Eq (2), if we substitute y with y* and X with X*, we get the following estimator for the scaled linear model:

The GLS estimator for β *(Image by Author)*

To simplify Eq (17), we substitute y* = Gy, X* = GX, and ϵ* = Gϵ. We also make use of the identity (GX)’ = X’G’. And we also make use of the result that the inverse of the Ω matrix is G’G. This later result follows from Equations (7) and (8). These substitutions are shown below:

Eq (18) is the GLS estimator (also known as the Aitken’s Generalized Least Squares estimator). It is an efficient, consistent, unbiased estimator whether or not the data set exhibits heteroskedasticity and/or auto-correlation. In other words, it is guaranteed to be the Best Linear Unbiased Estimator of the coefficients vector β.

There is only one problem with it.

It relies on us knowing the covariance matrix of errors Ω. But Ω is intrinsically unobservable as it contains the covariance of the model’s error term which cannot be directly observed by the experimenter.

The solution is to create a model of Ω and estimate it. Then use that estimate to estimate β using the GLS estimator. This strategy of using an estimated version of Ω is sometimes called the Feasible Generalized Least Squares (FGLS) technique.

There are several strategies for estimating Ω. One such technique models Ω as follows:

In the above model, ω_ii is the variance of the ith error ϵ_i corresponding to the ith row in the data set. We estimate the n variances ω_ii by regressing the residuals on the predicted values of y from an OLS model fitted on the data set (we’ll see how to do this next week in PART II of this article).

ρ is the correlation between the ith and the (i+1)th error term, namely ϵ_i and ϵ_(i+1). We estimate ρ via the auto-correlation plot of the residuals of an OLS model fitted on the data set.

In the above matrix, we have assumed that the correlations between error terms decay via a power-law i.e., ρ², ρ³,…etc. as the separation between the corresponding data set rows increases.

In PART II of this article (due out next week) we’ll work through a tutorial on how to use the Generalized Least Squares estimator to fit a linear model on the ACS data set for estimating county-level poverty in the US.

Stay tuned!

References, Citations and Copyrights

Data set

The The American Community Survey data set used in this article can be downloaded from here. The complete ACS data set can be pulled from the US Census Bureau’s website using publicly available APIs (See Terms of Service and this link), or directly from the Census Bureau’s Community Explorer website.

Paper

Aitken, A. (1936). IV. — On Least Squares and Linear Combination of Observations. Proceedings of the Royal Society of Edinburgh, 55, 42–48. doi:10.1017/S0370164600014346

Images

All images in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on topics devoted to regression, time series analysis, and forecasting.