The world’s leading publication for data science, AI, and ML professionals.

Controlling for “X”?

Understanding linear regression mechanics via the Frisch-Waugh-Lovell Theorem

Introduction

Applied Econometrics is generally interested in establishing causality. That is, what is the "treatment" effect of T on some outcome y. In a simple bivariate case, we can imagine randomly assigning treatment T=1 to some individuals and T=0 to others. This can be represented by the following linear regression model:

(1)
(1)

If we assume the treatment is truly randomly assigned, then T is independent to the error term or, in the economists jargon, exogenous. Therefore, we can estimate eq. (1) using ordinary least squares (OLS) and interpret the coefficient estimate on T with a causal interpretation – the average treatment effect (ATE):

(2)
(2)

However, when we are dealing with non-experimental data, it is almost always the case that the treatment of interest is not independent to the error term or, again in the economists jargon, endogenous. For example, suppose we were interested in identifying the treatment effect of time spent reading books as a child on an individuals future educational attainment. Without any random assignment of time spent reading as a child, estimating a naïve linear regression, as in eq. (1), will fail to capture a large set of additional factors that may drive individual time spent reading books and educational attainment (i.e., socioeconomic status, parents education, underlying ability, other hobbies, etc.). Thus, when dealing with non-experimental data, we must rely on controlling for additional covariates and then make an argument that treatment is now "as good as randomly assigned" to establish causality. This is known as the conditional independence assumption (CIA). In our educational example above, we can reframe eq. (1) as:

(3)
(3)

where we now control for a set of observed covariates X. The key estimate of interest on Read takes on a causal interpretation if and only if the CIA holds. That is, time spent reading is exogenous (i.e., no uncontrolled confounders) conditional on X. Equivalently,

(4)
(4)

Without the CIA, our coefficient estimates are biased and we are limited in what we can say in terms of causality. Realistically, it is often quite difficult to make the argument for the CIA and, unfortunately, this assumption is not directly testable. In fact, what I have discussed above is a fundamental motivator for an entire field of econometrics that is devoted to establishing, developing, and implementing quasi-experimental research designs to establish causality including, but most definitely not limited to, difference-in-differences, synthetic control, and instrumental variable designs. These quasi-experimental designs seek to exploit exogenous ("as good as random") sources of variation in a treatment of interest T to study the causal effect of T on some outcome(s) y. There are some excellent econometric texts that are accessible to those with little to no background in econometrics, including "The Effect" by Nick Huntington-Klein, "Causal Inference: The Mixtape" by Scott Cunningham, or "Causal Inference for the Brave and True" by Matheus Facure Alves.[1][2][3] Joshua Angrist and Jörn-Steffen Pischke provide a deeper dive in "Mostly Harmless Econometrics" for those interested.[4]

Despite the fact that establishing the CIA is particularly difficult through controlling for covariates alone, there is a substantial theorem in econometrics that provides some very powerful intuition into what it really means to "control" for additional covariates. Ultimately, this not only provides a deeper understanding to the underlying mechanisms of a linear regression, but also how to conceptualize key relationships of interest (i.e., the effect of T on Y).

Note that I have (intentionally) glossed over some additional causal inference/econometric assumptions, such as Positivity/Common Support & SUTVA/Counterfactual Consistency. In general, the CIA/Ignorability assumption is the most common assumption that needs to be defended. However, it is recommended that the interested reader familiarize themselves with the additional assumptuons. In brief, Positivity ensures we have non-treated households that are similar & comparable to treated households to enable counterfactual estimation & SUTVA ensures there is no spillover/network type effects (treatment of one individual impacts another).

Frisch-Waugh-Lovell Theorem

In the 19th century, econometricians Ragnar Frisch and Frederick V. Waugh developed, which was later generalized by Michael C. Lovell, a ~super cool~ theorem (the FWL Theorem) that allows for the estimation of any key parameter(s) in a linear regression where one first "partials out" the effects of the additional covariates.[5][6] First, a quick refresher on linear regression will be helpful.

A linear regression solves for the best linear predictors for an outcome y given a set of independent variables **** X, where the fitted values of y are projected onto the space spanned by X. In matrix notation, the linear regression model we are interested in is characterized by:

(5)
(5)

The goal of linear regression is to minimize the residual sum of squares (RSS), thus can be solved via the following optimization problem:

(8)
(8)

Taking the derivative and setting equal to zero, the optimal solution to (6) is:

(9)
(9)

This is the ordinary least squares (OLS) estimator that is the workhorse behind the scenes when we run a linear regression to obtain the parameter estimates. Now with that refresher out of the way, let’s get to what makes the FWL so great.

Let’s return to our example of estimating the educational returns to reading as a child. Suppose we only want to obtain the key parameter of interest in eq. (3); that is, the effect of days per month spent reading as a child on educational attainment. Recall that in order to make a causal statement about our estimate, we must satisfy the CIA. Thus, we can control for a set of additional covariates X and then estimate (3) directly using the OLS estimator derived in (7). However, the FWL Theorem allows us to obtain the exact same key parameter estimate on Read under the following 3-step procedure:

  1. Regress Read onto the set of covariates X only and, similarly, regress Education onto the set of covariates X only
(8)+(9)
(8)+(9)
  1. Store the residuals after estimating (8)+(9) denoted Read* and Education*
(10)+(11)
(10)+(11)
  1. Regress Education* onto Read*
(12)
(12)

And that’s it!

Intuitively, the FWL theorem partials out the variation in _Read (_the treatment/variable of interest) and _Education (the outcome of interest)_ that is explained by the additional covariates, and then uses the remaining variation to explain the key relationship of interest. This procedure can be generalized for any number of key variables of interest. For a more formal proof of this theorem, refer to [7]. The FWL theorem has been in the spotlight recently as the theoretical underpinning for debiased/orthogonal machine learning where steps 1 and 2 are conducted using machine learning algorithms rather than OLS. There are very cool developments occurring that are bridging the gaps between econometrics and machine learning, and I hope to have future posts with some cool applications with respect to some of these new methods. However, part 2 of Matheus Facure Alves’ "Causal Inference for the Brave and True" is a great place to start.

Now you may wonder why in the world would you ever go through this process to obtain the exact same key estimate. Well for one, it provides an immense amount of intuition behind the mechanisms in a linear regression. Secondly, it allows you to visualize the remaining variation in your treatment (Read) that is being used to explain the remaining variation in your outcome (Education). Let us look at this in action!

FWL Theorem Application

In this section, we are going to simulate a highly stylized dataset to provide a simplified numerical example of applying the FWL theorem in answering our empirical question of the educational returns to childhood reading.

Suppose we hypothesize a set of demographic variables that we determine to be the relevant confounders necessary to satisfy the CIA in eq. (3), and thus allowing us to obtain a causal interpretation for the education returns to childhood reading. Namely, suppose we identify the key confounders to be the average education level of both parents in years (pareduc), household income as a child in tens of thousands of dollars (HHinc), and IQ score (IQ). We will artificially generate our dataset and the following data generating process (DGP) for the confounders as follows:

(13)
(13)

Furthermore, to estimate eq. (3) we must have measures for the key treatment, average number of days in a month they read as a child (read), and the main outcome, their total educational attainment in years (educ). We artificially generate these key variables with gaussian error terms and heteroskedasticity in the education error term as follows:

(14)
(14)

Because we know the true DGP, the true value for the parameter of interest is 0.2. __ Let’s take this DGP to Python and simulate the data:


Note that all values in the DGP were, in general, chosen arbitrarily such that the data works nicely for demonstration purposes. However, within the realm of this simulation we can interpret the coefficient on "read" as follows: On average, for each additional day a month that an individual read as child, their educational attainment increased by 0.2 years.



## Import Relevant Packages
import pandas as pd
import numpy as np
from scipy.stats import skewnorm
import seaborn as sns
import matplotlib.pyplot as plt

## Data Generating Process
df = pd.DataFrame()
n = 10000

# Covariates
df['pareduc'] = np.random.normal(loc=14,scale=3,size=n).round()
df['HHinc'] = skewnorm.rvs(5,loc=3,scale=4,size=n).round()
df['IQ'] = np.random.normal(loc=100,scale=10,size=n).round()

# Childhood Monthly Reading
df['read'] = (-25+
              0.3*df['pareduc']+
              2*df['HHinc']+
              0.2*df['IQ']+
              np.random.normal(0,2,size=n)).round()

df = df[(df['read']>0) &amp; (df['read']<31)] # Drop unrealistic outliers

# Education Attainment
df['educ'] = (-15+
              0.2*df['read']+
              0.1*df['pareduc']+
              1*df['HHinc']+
              0.2*df['IQ']+
              df['read']/15*np.random.normal(0,2,size=len(df)).round())

## Plot Simulated Data
fig, ax = plt.subplots(3,2,figsize=(12,12))
sns.histplot(df.HHinc,color='b',ax=ax[0,0],bins=15,stat='proportion',kde=True)
sns.histplot(df.IQ,color='m',ax=ax[0,1],bins=20,stat='proportion',kde=True)
sns.histplot(df.pareduc,color='black',ax=ax[1,0],bins=20,stat='proportion',kde=True)
sns.histplot(df.read,color='r',ax=ax[1,1],bins=30,stat='proportion',kde=True)
sns.histplot(df.educ,color='g',ax=ax[2,0],bins=30,stat='proportion',kde=True)
sns.regplot(data=df,x='read',y='educ',color='y',truncate=False,ax=ax[2,1])
plt.show()

The data will look a little something like:

Figure 1
Figure 1

The graph in the bottom right provides the scatter plot and naïve regression line of educ on read. This relationship, on the surface, shows a very strong positive relationship between days read a month as a child and educational attainment. However, we know that by construction this is not the true relationship between educ and read because of the common confounding covariates. We can quantify this result and the bias more formally via regression analysis. Let’s now go ahead and estimate the naïve regression (i.e., eq. (3) less X), the multiple regression with all relevant covariates (i.e., eq. (3)), and the FWL 3 step process (i.e., eqs. (8)-(12)):

import statsmodels.formula.api as sm

## Regression Analysis

# Naive Regression
naive = sm.ols('educ~read',data=df).fit(cov_type="HC3")

# Multiple Regression
multiple = sm.ols('educ~read+pareduc+HHinc+IQ',data=df).fit(cov_type='HC3')

# FWL Theorem
read = sm.ols('read~pareduc+HHinc+IQ',data=df).fit(cov_type='HC3')
df['read_star']=read.resid

educ = sm.ols('educ~pareduc+HHinc+IQ',data=df).fit(cov_type='HC3')
df['educ_star']=educ.resid

FWL = sm.ols("educ_star ~ read_star",data=df).fit(cov_type='HC3')

## Save Nice Looking Table
from stargazer.stargazer import Stargazer

file = open('table.html','w')

order = ['read','read_star','HHinc','pareduc','IQ','Intercept']
columns = ['Naive OLS','Multiple OLS','FWL']
rename = {'read':'Read (Days/Month)','read_star':'Read*','hhincome':'HH Income',
          'pareduc':"Avg. Parents Education (Yrs)"}

regtable = Stargazer([naive, multiple, FWL])
regtable.covariate_order(order)
regtable.custom_columns(columns,[1,1,1])
regtable.rename_covariates(rename)
regtable.show_degrees_of_freedom(False)
regtable.title('The Simulated Effect of Childhood Reading on Educational Attainment')

file.write(regtable.render_html())
file.close()

The regression results are:

Table 1
Table 1

Table 1 above presents the regression output results. Immediately, we can observe that the naïve regression estimate on read is biased upwards due to the confounding variables that are both positively related with educational attainment and childhood reading. When we include the additional covariates in column (2), we get an estimate near the true value of 0.2 as constructed in the DGP. The FWL 3-step process yields the exact same estimate, as expected!

A general rule of thumb for signing bias in a regression is the sign of cov(outcome,X) multiplied by the sign of cov(treatment,X). By construction, we have the cov(educ,X)>0 and cov(read,X)>0 and, hence, positive bias.

So, we have now shown the FWL being used to obtain the same estimate, but the real power in FWL lies in the ability to plot the true relationship. Figure 2 below shows the initial relationship of the naïve regression without factoring in the covariates and then the relationship of the residuals from the FWL process, where the noise is from the stochastic error term in DGP. In this case, the FWL slope is the true relationship! We can see how vastly different the slope estimates are. This is where the true power of the FWL theorem lies! It allows us to visualize the relationship between a treatment and outcome after we partial out the variation that is already explained by the additional covariates.

Figure 2
Figure 2

Discussion

We have discussed the Frisch-Waugh-Lovell Theorem in-depth and have provided an intuitive approach to understanding what it means to "control" for covariates in a regression model when one is interested in a treatment parameter. It is a powerful theorem and has provided a strong underpinning for many econometric results that have developed over the years.

FWL provides a powerful mechanism by which to visualize the relationship between an outcome and treatment after one partials out the effects from additional covariates. In fact, FWL can be used to study the relationship between any two variables and the role covariates play in explaining their underlying relationship. I recommend trying it out on a dataset where you are interested in the relationship between two variables, and the role of additional covariates in confounding that relationship!

I hope you have gained some new knowledge from this post!

References

[1] N. Huntington-Klein, The Effect: An Introduction to Research Design and Causality (2022).

[2] S. Cunningham, Causal Inference: The Mixtape (2021).

[3] M. F. Alves, Causal Inference for the Brave and True (2021).

[4] J. Angrist & J.S. Pischke, Mostly Harmless Econometrics: An Empiricist’s Companion (2009). Princeton University Press.

[5] Frisch, Ragnar, and Waugh. Partial Time Regressions as Compared with Individual Trends (1933). Econometrica: Journal of the Econometric Society, 387–401.

[6] Lovell. Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis (1963). Journal of the American Statistical Association 58 (304): 993–1010.

[7] Lovell. A Simple Proof of the FWL Theorem (2008). Journal of Economic Education. 39 (1): 88–91.


Access all the code via this GitHub Repo: https://github.com/jakepenzak/Blog-Posts

I appreciate you reading my post! My posts on Medium seek to explore real-world and theoretical applications utilizing econometric and statistical/machine learning techniques. Additionally, I seek to provide posts on the theoretical underpinnings of certain methodologies via Theory and simulations. Most importantly, I write to learn! I hope to make complex topics slightly more accessible to all. If you enjoyed this post, please consider following me on Medium!


Related Articles