Instrumental Variables: A practical explanation

Published in

Towards Data Science

4 min readJan 14, 2020

Introduction

In statistical analysis, we are quick to overlook certain problems that come with predictor variables. This is, that our independent variables, often possess underlying properties which directly impact the validity of model results.

In regression analysis, we are tasked with estimating causal relationships between our independent and dependent variables. We assume that this causal relationship is consistent across an experiment. However, when we believe this assumption to be violated, we would use instrumental variables (IV) to correctly predict the ‘treatment’ effect of a given independent variable. Thus, instrumental variables are used to provide true effects, rather than biased effects.

In this blog, I will be demonstrating as to why IV estimation provides value to statistical frameworks using an example of returns to education. It’s also important to note that using IVs is not always required or necessary, but this blog will hopefully inform you as to when it may be appropriate.

Why use IVs?

Sometimes in regression analysis, we overlook some factors that are intrinsically linked to certain independent variables in question. Regression analysis serves the purpose of finding the causal effect, ceteris paribus, of an independent variable on our dependent variable. However, it is often the case that in reality, by increasing said independent variable, we observe a change that is different to what our model would predict.

Joshua Angrist’s seminal work on estimating wage returns based on military participation in the Vietnam showed some enlightening results as to how, under the hood, an independent variable such as military participation is not wholly informative. Angrist found that while many individuals volunteered, a lot of men were also conscripted and this had different resultant impacts on wage returns.

So, Angrist used conscription as an instrumental variable which we expect to be correlated with military participation, but not with our error term and thus not with wages. Through this, different estimates were obtained which were a better reflection of the true effect of military participation.

Interpreting OLS coefficients

Suppose we observe the following regression obtained using Ordinary Least Squares (OLS):

y = α + βX + ε

In a model prediction, we would infer the following effect on y if we were to increase X by one unit:

Δy = β × 1

The case for the use of instrumental variables arises when the above is not the true estimation of an increase in X. Essentially, I am trying to convey a message that an independent variable may be correlated with the error term epsilon.

Change in y when X is correlated with the error term:

Δy = β × 1 + δε

In unbiased OLS estimation, explanatory variables must not be correlation with the error term. Explanatory variables are said to be exogenously related to the error term and should not be able to explain error. Hence, the case above where an independent variable may be correlated with error violates this OLS assumption and must be catered for in some way. Variables which behave in this way are said to be endogenous variables.

Hello instrumental variables

IV Assumptions and Methodology

Following on from the explanation of why we may want to use instrumental variables we need an instrument, Z, to satisfy the following assumptions:

Relevance: Z can predict ΔX i.e. cov(Z, X) ≠0
Exogeneity: Z is uncorrelated with the error term i.e. cov(Z, ε) = 0

Relevance is important because it essentially states that our instrument of choice is correlated with our independent variable of our choice

Exogeneity is important because it states that our instrument is uncorrelated with the error term

In words, these assumptions mean that the instrument must affect y ONLY through X and that it must have some effect on X.

IV Application: Returns to schooling

We may come across the following regression when trying to ascertain the returns to schooling:

log(wages) = α + βEdu + ε

Where log(wages) is our outcome variable we are trying to predict
α is some constant
Edu is an independent variable that captures years of education as a continuous variable
ε is the error

We may believe Edu to be somewhat correlated with our error term. This problem may arise due to several reasons:

Family background may affect years of schooling
Ability, which could (arguably) be measured by IQ, may affect years of schooling

Now, we can use one of these to control our independent variable. We must ensure they satisfy our two assumptions: relevance and exogeneity. It must be able to predict Edu as well as being uncorrelated with the error term.

The Next Step: 2SLS

We use Two-Stage Least Squares to estimate a new Edu parameter which we will then substitute in our initial regression line.

Firstly, we compute an estimate for education using our instrumental variable of family backgrounds

The second step is to then substitute this estimation into our initial regression of wage estimates

Results

Now, we hopefully obtain an unbiased estimator of the effects of schooling with regards to wages. IV estimates are said to be more efficient than OLS estimates if we believe that our predictors violate the property of exogeneity

Conclusion

I hope that this blog post has informed you as to the intuition behind using instrumental variables in statistical frameworks. While this is a rather simplistic example, there are often opportunities to use several instrumental variables to control for estimators. Also, users must be weary that it is not always the case that instrumental variables improve the validity or robustness or models.

References

Joshua D. Angrist, Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records, June 1990 (https://www.jstor.org/stable/2006669?seq=1)