Hands-on Tutorials

Causal ML for Data Science: Deep Learning with Instrumental Variables

Combining data science and econometrics for an introduction to the DeepIV framework, including a full Python code tutorial.

Haaya Naushan

Published in

Towards Data Science

19 min readMay 13, 2021

**Map of the West Bank, Palestine, showing small peripheral neighbourhoods in red and larger more central neighbourhoods in blue. Image by author.**

Historically, both economists and philosophers have been preoccupied with extracting an understanding of cause and effect from empirical evidence. David Hume, an economist and philosopher, is renowned for exploring causality, both as an epistemological puzzle and as a matter of practical concern in applied economics. In an article titled “Causality in Economics and Econometrics”, economics professor Kevin D. Hoover states, “economists inherited from Hume the sense that practical economics was essentially a causal science.” (Hoover, 2006). As a capital “E” Empiricist, Hume was a major influence on the development of causality in economics; his skepticism created a tension between the “epistemological status of causal relations and their role in practical policy.” (Hoover, 2006). In 1739, when Hume famously defined causation in the “Treatise of Human Nature”, I doubt he would have been able to foresee the radical change wrought by the exponential progress of technological evolution. Nor could he have imagined our present-day reality where deep learning is used to determine cause and effect.

Today, “causal science” is being driven by machine learning; however, it is still a nascent area and development has focused primarily on theory. In my last article about causal ML, I mentioned my optimism that advancements in theoretical econometrics can be applied to social research. Prior to that, in an earlier causality article, I presented a case study of how econometrics can be used for data science, relying on a paper by Alexei Abrahams titled, “Hard traveling” (Abrahams, 2021). In this article, I return to Abrahams’ study of the causal effect of Israeli checkpoints on Palestinian labour outcomes, but I utilize deep learning to tackle the instrumental variables. Fortuitously, Abrahams’ study presented an opportunity to apply a recent ML advancement in econometrics to socially-centered research.

In support of practicality, I build this article around a full Python code tutorial that implements the DeepIV framework (Hartford et al., 2017) supported by Microsoft Research’s EconML library. To start, I briefly revisit “Hard traveling” with a focus on highlighting the salient points of the data science study, allowing for a discussion of the limitations of the original approach. Following that, I suggest an alternate approach that extends Abrahams’ dataset to test for heterogeneous treatment effects. I support this alternative choice with a discussion of confoundedness, instrumental variables, and an explanation of the importance of determining the conditional average treatment effect (CATE). A short introduction to deep learning follows, where I situate this subset of ML within econometrics to explain the necessity of benchmarking deep learning results. Next, I provide a complete walkthrough of the DeepIV framework, which leads directly into a Python implementation of the technique on Abrahams’ dataset. To finish, I verify the deep learning results with CATE estimates from a shallow causal ML technique, viz. benchmarking the DeepIV estimates with an instrumented causal forest.

Hard traveling

I provide a detailed walkthrough of “Hard traveling” in an earlier article. The main goal of this impact evaluation was to investigate the causal impact on Palestinian unemployment rates of Israeli army checkpoints and road obstacles deployed along the internal road network of the West Bank.

The obstacles seen in the image above were deployed for security reasons, but had the effect of disrupting Palestinian commuter travel. By preventing commuters from reaching commercial centers and border crossings, employment losses were accrued; however, the losses were substantially offset by employment gains among the commuters’ more centrally located Palestinian competitors. That is to say, marginal economic interventions, such as improving road infrastructure, served to alter the spatial distribution of unemployment, but did not reduce overall unemployment levels.

In “Hard traveling”, Abrahams uses an instrumented 2SLS (two stage least squares) first-differences strategy, where the instrument is the lengthwise proximity of Israeli settlements to Palestinian commuter routes. This method is effective because the instruments isolate the subset of obstacles that are deployed within the immediate vicinity of settlements, rather than directly regressing the percent change in employment on the overall presence of checkpoints. The results of this study are shown in the image below, where the unemployment losses in the peripheral neighbourhoods were in contrast to the employment gains in the central neighbourhoods.

**Spatial histogram showing the main results of “Hard traveling”. The obstruction effects are higher in the peripheral rural areas and the protection effects are higher in the commercial centers.** **Source: Abrahams, 2021**

Confoundedness and Instrumental variables

In my favourite econometrics book, “Mostly Harmless Econometrics”, the authors state, “the most important contemporary use of IV methods is to solve the problem of missing or unknown control variables” (Angrist & Pischke, 2009). When using observational data, there is a risk that latent unobservable variables will have a confounding effect, making it impossible to disentangle the relationships between the variables, such that cause and effect can be determined. This requires us to make explicit assumptions about the causal structure of the data generating process (DGP). Usually, this amounts to assuming that the treatment is conditionally independent of any latent variables given the observed covariates. Even so, such an assumption is not always possible; therefore, there is a need for instrumental variables which only affect treatment assignment and not the outcome variable. Essentially, instrumental variables are necessitated when we cannot operate under the unconfoundedness assumption, and this crucial point is central to the design of Abrahams’ evaluation.

Abrahams conducted the research for “Hard traveling” between 2012 and 2014, and during that time, an instrumented 2SLS regression was the best option to determine causal effects. The limitation, however, is that this builds a linear model that assumes homogeneous treatment effects. The causal effect calculated in Abrahams’ study, therefore, can be referred to as the local average treatment effect (LATE), which has less statistical power due the assumption of homogeneity. Importantly, assuming a constant causal effect obviates the possibility that the average treatment effect is conditional on covariates. Furthermore, a linear model may not be the best representation of the structural form of the DGP, hence the need for non-parametric IV methods. To deal with the two limiting assumptions of linearity and homogeneity, a non-parametric method should be used to estimate the conditional average treatment effects (CATE).

Why estimate the CATE?

In my last causal ML article, I provided a walkthrough of the conditional average treatment effect (CATE), and suggested that the importance of the CATE is related to equitable policy assignments. This is absolutely true, however, there are other reasons to look at treatment effect heterogeneity. Furthermore, the case study data in my last article was from a randomized control trial (RCT), which similarly to A/B testing, allows for causal inference without interference from confounders. In “Hard traveling”, however, Abrahams uses observational data, and as seen in the image below, estimating the CATE is complicated by the potentially confounding effects of covariates.

**Causal diagram of the conditional average treatment effect (CATE) in observational research settings. Image from** **Jacob (2021**).

Beyond equity, for studies utilizing observational data, the CATE is also important for assessing the validity of a research design. According to Angrist & Pischke (2009), treatment effect heterogeneity is important because of the distinction between the two types of validity that characterize a research design. They explain that internal validity is whether a design uncovers causal effects, and a good IV study, like “Hard traveling” has a strong claim to internal validity. External validity, on the other hand, is the predictive value of a study’s findings in a different context, in ML terms this refers to the generalization ability of a model. Angrist & Pischke state, “An econometric framework with heterogeneous treatment effects helps us assess both the internal and external validity of IV estimates.” (Angrist & Pischke, 2009).

Regarding external validity, in the framing of “Hard traveling” Abrahams suggests that absent political reform, economic interventions will not be able to effectively improve labour outcomes. Contextually, this is a bold and unconventional claim for an economist, but is supported by a substantial political economy literature. Estimating the CATE, therefore, will help in an assessment of the external validity of Abrahams’ impact evaluation. In other words, understanding the heterogeneity of the treatment effects will help determine whether the results of the study can be extended to support a broader claim of the necessity of political reform. Although, the CATE estimates would allow us to identify the specific Palestinian neighbourhoods that suffer the greatest negative causal effect, arguably, equitable policy assignment is less important than assessing the external validity of the evaluation.

Besides external validity, the CATE is useful for identifying subgroups for whom the average treatment effects will differ. Given Abrahams’ results, I suspected that there would be a difference in the CATE between the smaller peripheral neighbourhoods, and the larger more central neighbourhoods. To test this hypothesis, I used the open source desktop GIS software, QGIS, and the geojson data shared by Abrahams to separate the Palestinian neighbourhoods by population size.

Map of the West Bank, Palestine showing the small peripheral neighbourhoods (red) with population totals below the median and the larger neighbourhoods (blue) with population totals above the median. Image by author.

As seen in the map above, the smaller peripheral neighbourhoods (in red) are located at the terminal nodes of the road network structure, while the larger neighbourhoods (in blue) are more centrally situated within the road infrastructure. The Palestinian neighbourhoods are split along the median of the population totals for all the neighbourhoods, which amounted to a population size of 1885. This cleanly splits Abrahams’ dataset in half, and the equal sample sizes allowed for a comparison between the DeepIV CATE estimates for the subgroups.

Deep learning in Econometrics

Having established the necessity of estimating the CATE, the next step is to pick a non-parametric IV method to model the data generating process (DGP). Deep learning has proven to be a powerful method of learning latent representations of complex feature spaces, which makes it ideal for non-parametric modeling. Nonetheless, the use of deep learning in economics is a controversial subject, primarily because of the “black box” nature of deep neural networks. Arguably, econometrics demands a higher degree of rigour, therefore, the comparative lack of mathematical formality in the abstracted understanding behind deep neural networks is a point of detraction for this type of ML. Moreover, economists Susan Athey and Guido W. Imbens mention that there is no formal proof for the supposed superiority of deep learning and neural networks over regression trees and random forests for supervised learning problems (Athey & Imbens, 2019).

The “deep” in deep learning, simply refers to the multiple layers of the neural network, whereas a shallow network would have a single layer. A “deep” discussion of this subset of ML is beyond the scope of this article; however, there are several excellent resources available that cover this subject. For an introduction, I highly recommend “Deep Learning with Python”, which is written by François Chollet, who is the creator of the popular deep learning API, Keras. I also suggest this helpful blogpost by Jason Brownlee on his blog, “Machine Learning Mastery”. Also, there are a slew of articles on Medium that provide introductions to the varied applications of this broad topic, which include natural language processing, computer vision and others. In the code tutorial that follows, I use the beginner-friendly Keras, and provide simple explanations where necessary.

Regarding the healthy skepticism of deep learning within economics, I believe that while the lack of interpretability is an issue, it should not discount the use of this technique. A careful data scientist knows that for specific tasks, it is important to benchmark deep learning models against shallow, interpretable models; there is a very real possibility that the more interpretable model will perform better. Fortunately, causal forests can be used as a comparable non-parametric IV method to benchmark the results of deep learning IV estimation. Therefore, following the proposed deep learning implementation, I will walk through the code required to benchmark the results against the more interpretable causal forest. The next section introduces the DeepIV framework, which makes it possible to use deep learning with instrumental variables to estimate the CATE.

DeepIV framework

In “DeepIV: A Flexible Approach for Counterfactual Prediction”, Hartford et al. (2017) augment deep learning methods to characterize the causal relationship between treatment and outcome variables. The framework uses instrumental variables, which work as sources of treatment randomization that are conditionally independent from the outcomes. Understanding this causal relationship is a necessity for counterfactual prediction, and the DeepIV framework accomplishes this with two prediction tasks that can be solved with deep neural networks. Similarly to the two stages of a 2SLS, the first-stage is a network for treatment prediction and the second-stage network predicts outcomes.

As I explained in my previous causal ML article, counterfactual predictions cannot be made in a straightforward manner; however, since instrumental variables are as good as randomization, they allow for “selection on unobservables”. Hartford et al. refer to this setup as the “structure of the DGP under the IV specification”, and this causal relationship can be described by the causal graph below.

**Causal graph of the DGP under the IV specification.** **Source: Hartford et al., 2017**

In the diagram above, x represents the covariates or observable features, p represents the treatment (policy) variables, z represents the instruments, y is the outcome, and e represents the latent effect of unobservables. In this setup, the error term, e influences y in an additive manner.

With the DeepIV framework, the IV analysis occurs over two supervised stages, specifically, the first-stage models the conditional distribution of the treatment variable p given the instruments z and covariates x. Accordingly, in the second stage, a targeted loss function is used that involves integration over the first-stage conditional treatment distribution, a solution that simply requires adapting off-the-shelf algorithms. In both stages, a deep neural network is trained via stochastic gradient descent (SGD), and Hartford et al. present an out-of-sample causal validation procedure to select the hyperparameters of the models on a validation set.

Solving the challenge of counterfactual prediction requires an understanding of the relationship between the variables described above: y, p, x, z and e. As seen in the equation below, the structural form of y is determined by p, x and e.

Structural form of outcome variable y. Source: Hartford et al., 2017

The unknown function g is a potentially non-linear continuous function of both x and p, and the latent effect e (ie. error) effects y additively with unconditional mean 𝔼e. This gives us the structural equation h as the counterfactual prediction function. As seen below, the latent effect e is added as 𝔼[e⎟x], such that it is only conditioned on x and remains constant under changes to the treatment p.

The counterfactual prediction function. Source: Hartford et al., 2017

Following the 2SLS procedure, to solve the above equation, h(p, x), we need to estimate an unbiased h_hat(p, x), which is possible with the presence of instruments z. An unbiased estimate is possible because z satisfies three conditions, relevance, exclusion and unconfoundedness. Relevance means that the conditional distribution F, a treatment density function, varies with z. Exclusion means the outcome y only depends on the treatment p and the covariates x. Taking the conditional expectation of both sides of function g, conditioned on [x⎟z], gives us the following equation:

**Conditional expectations of the structural form of outcome variable y, conditioned on covariates x, and instruments z.** **Source: Hartford et al., 2017**

In the above equation, dF(p⎟x,z) is the conditional treatment distribution that needs to be estimated in the first stage. Usually, to estimate h_hat, the treatment density function F, has to be replaced with a F_hat. Compared to 2SLS, the difference of the DeepIV framework is in avoiding an analytic solution to the integral in the above equation. Instead, the first stage estimates the treatment density function F_hat(p⎟x,z), modeled as a mixture of normal distributions (Gaussians), where the parameters of the mixture density model are the output of the first-stage network. As Hartford et al. explain, with enough mixture components the network can “approximate arbitrary smooth densities” (Hartford et al., 2017). With a 2SLS, linear models are built for F_hat and the counterfactual prediction function h. However, as previously mentioned this approach requires two strong assumptions of linearity and homogeneity.

Given the output of the first-stage, the second stage optimizes the loss function shown below.

**Loss function used to optimize the estimate of the structural equation h_hat.** **Source: Hartford et al., 2017**

The loss function minimizes the 𝓵2 loss to optimize the estimate of h_hat. In both of these stages, the hyperparameters are chosen to minimize the respective loss functions using a held-out validation set. This means that improvements in performance will correlate to improvements on the true structural loss, which cannot be evaluated directly. With the DeepIV framework causal validation is necessary, since cross-validation, which is the usual method for hyperparameter selection, cannot work when there are no samples of the counterfactual outcome. Essentially, using held-out validation data allows for an out-of-sample causal validation. Hartford et al. note that this method provides “relative” performance guidance, since it improves performance on counterfactual prediction problems, without telling us how far the estimate of h is from the true value of h(p, x). The following section outlines the code required to implement the two stages of the DeepIV framework.

Python implementation of the DeepIV framework

Since it will be faster, I suggest using Google Colab to make use of the free GPU instance to train the deep neural networks. Additionally, Colab makes it easier to install the full version of EconML with the proper package dependencies, without having issues with the Tensorflow backend. The “Hard traveling” data is available here as a Stata “.dta” file (Abrahams, 2021). The first step is to run the following line of code in Colab to install the full version of EconML.

!pip install econml[all]

The following two lines allow us to load Abrahams’ dataset in the Colab notebook.

from google.colab import files
files.upload()

The next step is to import the required packages, load the data, and do a little preprocessing. In the code snippet below, we rename the necessary columns, normalize the variables, and replace NAN values with a “0”.

In “Hard traveling “, Abrahams uses 71 covariates in the main regression. Following suit, here we load the 71 controls as dummy variables.

Since we are testing the hypothesis that the subgroups experience different CATEs, in the code snippet below we divide the dataset in half using the median value of the population totals (median = 1885). Following that, separate variable (outcome, treatment, covariates, instruments) sets are defined for the two subgroups and converted to numpy arrays to fit the neural networks.

Next, we build the deep neural models to be used as the first-stage treatment network and the second-stage outcome network. Both, in conjunction, will be used to build two separate estimation models for the two subgroups. As seen in the code snippet below, Keras is used to construct the neural networks by sequentially packaging the stacked neural layers. For the first dense layer, in both the treatment and outcome models, the input shape is 73. Which for the treatment network, equates to 71 covariates plus 2 treatment variables, and for the outcome network this equates to 71 covariates plus the 2 instrumental variables.

In both the treatment and outcome networks, there are three dense layers alternated with three dropout layers. However, the outcome network differs by having a fourth dense layer for a total of 7 layers. Each dense layer uses a rectified linear activation function or ReLU, and decreases in size by 50% (128, 64, 32) as the network increases in depth. The final seventh layer of the outcome network has an output size of 1. Each dropout layer drops units at a rate of 17%, which is a way of adding regularization to the network in order to avoid overfitting.

Next, it is necessary to set the parameters for the two estimation models, one for the smaller peripheral neighbourhoods, “deepIvEst_per”, and the other for the larger central neighbourhoods, “deepIvEst_not_per”. As seen in the code snippet below, first we define two sets of Keras parameters for fitting the models, one with 50 epochs and the other with 100 epochs. Based on my fine-tuning, I extrapolated that for the “deepIvEst_per” model, the first-stage treatment network requires more than 50 epochs to minimize the loss. Conversely, the first-stage outcome network and the two networks of the “deepIvEst_not_per” model, all require less than 50 epochs.

The validation splits in both parameter sets are set to 10%, this represents the hold-out data. For callbacks, we use Keras’ “EarlyStopping” with the “patience” set to 2, so that once the model starts to overfit, training will end and the best weights are restored. For early stopping, “patience” relates to the number of epochs with no improvement, after which training will be stopped. This means that the number of epochs specified act as a maximum threshold, not a mandatory target. For example, the treatment network for the “deepIvEst_not_per” model is set to 50 epochs, but for this network, training generally ends after 23 epochs.

As seen in the code snippet above, the parameters of the two estimation models for the two subgroups are nearly identical, the only difference is in the treatment network for the “deepIvEst_per” model, which is set to a maximum of 100 epochs. For both models, the “n_components” is set to “15”, this means that the mixture density functions for both are composed of 15 Gaussian distributions. Parameters “m” and “h” are the treatment and outcome networks that were detailed earlier; a lambda function maps the variables, (z, x) for the treatment model or (t, x) for the outcome model, to the Keras models. The result is that “m” and “h” are each a single tensor that are the concatenation of all respective inputs. The “n_samples” represents the number of samples that should be used to estimate the outcome.

The parameter “use_upper_bound_loss” is set to “True”. If set to “False” the gradient is estimated from two independent samples and this requires a forward pass through the network, therefore, it is “computationally intensive” (Hartford et al., 2017). Conversely, setting this parameter to “True” means that a single draw can be used to calculate the gradient, since this optimizes an upper bound on the loss function. The drawback is that the upper bound loss is only an approximation of the true loss. Hartford et al., however, claim that the upper bound loss had better performance “under practical computational limitations”. The “n_gradient_samples” is set to “0” since we are optimizing for the upper bound loss.

I tried a variety of optimizers, and selected the Adagrad optimizer since it has parameter-specific learning rates. This means that the learning rate is adapted relative to how frequently a parameter gets updated during training. Essentially, the more updates a parameter receives, the smaller the size of the gradient updates. The “first_stage_options” and “second_stage_options” are where we pass the Keras parameters that were set earlier; these control how the model is fit. Training the two estimation models, requires running the following two lines, separately. Each subgroup model is fit with the variables described earlier, (y,t,x,z) and (y2,t2,x2,z2) for the “deepIvEst_per” model and the “deepIvEst_not_per” model, respectively.

deepIvEst_per.fit(Y=y,T=t,X=x,Z=z)deepIvEst_not_per.fit(Y=y2,T=t2,X=x2,Z=z2)

Once the models have been trained, it is possible to predict the treatment effects for each neighbourhood. Each subgroup is given CATE estimates from their respective model. The results are stored in a dataframe, and the rolling mean is calculated for each set of treatment effects.

Lastly, as seen in the code snippet below, for a comparison we plot the CATE for the two subgroups.

I ran the training 20 times, collected the results and picked a random sample to plot, resulting in the following graph:

**The CATE for the peripheral neighbourhoods (blue) and the CATE for the central neighbourhoods (purple). Image by author.**

The median CATE for the small peripheral neighbourhoods ranged from 0.4 to 0.8, whereas the median for the larger central neighbourhoods ranged from 2.0 to 3.0. Additionally, for the lowest quantile of the peripheral neighbourhoods, the average CATE ranged from −0.3 to 0.3. What is clear, however, is the separation of the CATE values from the two subgroups, which is shown clearly in the graph above.

Causal forest benchmark

To benchmark the results from the DeepIV framework, for each subgroup we build an instrumented causal forest (Athey et al., 2019). In the code snippet below, we split Abrahams’ dataset into a train and test set for each subgroup. Then we set the outcome variable, treatment variables, covariates and instruments for fitting two causal forests, one for each subgroup.

Next, we set the parameters for the two causal forests and fit the models. For more details on how to choose the parameters for a causal forest, I suggest my previous causal ML article, which goes over these details.

We use the causal forest models to predict treatment effects and the upper and lower bounds of the confidence intervals, for each subgroup. Following that, the results for each subgroup are stored in a dataframe and the rolling mean is calculated.

Lastly, we plot the different treatment effects and confidence intervals from each causal forest.

Plotting only the CATE for both subgroups results in the following graph:

Plotting the CATE with confidence intervals for the peripheral neighbourhoods subgroup results in the following graph:

**The CATE for peripheral neighbourhoods (blue) with the upper and lower bound confidence intervals (green). Image by author.**

Plotting the CATE with confidence intervals for the central neighbourhoods subgroup results in the following graph:

**The CATE for central neighbourhoods (purple) with the upper and lower bound confidence intervals (green). Image by author.**

Similarly, to the DeepIV experiment, I ran the causal forests 20 times and collected the results. The median CATE for the small peripheral neighbourhoods ranged from 0.2 to 0.3, whereas the median for the larger central neighbourhoods ranged from 2.9 to 3.1. Interestingly, the average CATE ranged from -2.8 to -3.5 for the lowest quantile of the peripheral neighbourhoods, an average CATE range that is much lower than the DeepIV estimates (-0.3 to 0.3). Also, the causal forests’ median CATE ranges were smaller in size than the ranges for the DeepIV estimates, suggesting greater variance in the latter’s estimates.

Final thoughts

When comparing the estimates from the DeepIV framework and the causal forests, I was reminded of Abrahams’ results in “Hard traveling”. The net attenuated effect was close to zero, since, essentially, the obstruction and protection effects cancelled each other out. The CATE estimates from both the DeepIV framework and the causal forests, on the other hand, differ from Abrahams’ local average treatment effect (LATE). Importantly, in both the DeepIV models and the benchmark causal forests, there is a clear difference between the CATE values of the subgroups, suggesting the hypothesized subgroups were a valid separation. Since, the results of the two methods corroborate each other, I have support for claiming that I failed to reject the null hypothesis.

Benchmarking the deep learning results, with the more interpretable causal forest, allows for a comparison of two non-parametric IV methods. The strength of causal forests lies in the ability to calculate valid confidence intervals. In comparison, the strength of deep learning is the ability to create latent representations of complex feature spaces. The power of deep learning is that it helps us to move beyond the curse of dimensionality. Considering that Abrahams’ main setup included 71 covariates, I believe that this requires a high-dimensional feature space, hence, deep learning. Conversely, having valid confidence intervals is useful or arguably, necessary for making policy decisions. The choice of estimation method, however, should depend on the purpose of the experiment and the size and complexity of the data. For policy assignments, I would be more inclined to utilize the causal forest estimates, since we have a measurement of the uncertainty of said estimates. The DeepIV framework, however, is ideal for large and complex datasets.

I welcome feedback and questions, feel free to connect with me on Linkedin.