Thoughts and Theory

Practical Python Causality: Econometrics for Data Science

A data science introduction to econometrics with Python library: DoWhy, including a detailed code walkthrough of a case-study causality paper

Haaya Naushan
Towards Data Science
18 min readApr 1, 2021

--

Satellite image of the West Bank, Palestine from “Hard traveling: unemployment and road infrastructure in the shadow of political conflict” (Abrahams, 2021)

Data scientists have a tendency to focus on descriptive and predictive analysis, but neglect causal analysis. Decision making, however, requires causal analysis, a fact well recognized by public health epidemiologists during this Covid-19 pandemic. Due to my background in biology, I had internalized the adage “correlation does not equal causation”, to such an extent that I studiously avoided all causal claims. Fortunately, my insatiable curiosity led me to the field of econometrics, which embraces causality and sets down a body of rigorous mathematics to facilitate causal analysis.

Recently, my interest in econometrics has been fueled by my regionally-focused consulting work on the Middle East and North Africa (MENA) with the World Bank. Specifically, a recently published paper in the journal Political Science Research and Methods, details a rare, causal impact evaluation of Israeli checkpoints on Palestinian employment outcomes (Abrahams, 2021). In “Hard traveling”, the author utilizes an instrumental variable in a cleverly designed experiment, highlighting the causal effects of road blockades deployed by the Israeli army during the ongoing occupation of Palestine. In this article, I explore causality through the lens of practical application, by replicating the results of this interesting paper in a practical Python tutorial.

Firstly, I make clear the distinction between machine learning and econometrics, a necessary step to convey the complexity and difficulty of causality. Next, I introduce the Python libraries used in this tutorial and discuss an econometric approach to causal analysis. Following that, I outline the experimental design described in the case-study paper, which flows directly into an intuitive walkthrough of the relevant equations. Lastly, using Python, I replicate the main results of the paper with a practical emphasis on code implementation.

Why Econometrics?

Most data scientists are comfortable with machine learning but rarely utilize econometrics. Nonetheless, both econometrics and machine learning rely on applying statistical methods to data in order to empirically model and solve problems, with the superficial distinction being the application of econometrics to economic data. The important distinction, however, is in the type of questions that are commonly asked in the two fields. Traditionally, machine learning has been primarily focused on prediction; most often the goal is to create a model that is capable of generalization, such that predictions can be made for new, unobserved data. Conversely, econometrics is mostly concerned with causality, the goal being to understand cause-effect relationships, often within the scope of policy evaluation, such that recommended actions can be supported with empirical evidence.

Causality requires divining causal relationships from existing data while accounting for uncertainty, and similarly to prediction, it also relies on unproven assumptions that cannot be rooted in ground truth. Prediction asks “What?” will happen, causality asks “Why?” something happens and more importantly, “Why?” the effect was not caused by something else. Therefore, causal questions have a much higher burden of proof, with the advantage being the certainty of derived insights due to the rigour involved in addressing confounding factors and uncertainty.

As much as I appreciate the utility of machine learning, prediction cannot answer every question, and more often than not, causal inference is a necessity to support real life decision making. Moreover, I enjoy asking “Why?”, it is a human instinct to attempt to attribute effects to causes, despite the difficulty of satisfactorily answering causal questions. Through an accessible book, titled “Mostly Harmless Econometrics: An Empiricist’s Companion” (Angrist and Pischke 2009) I was encouraged to incorporate econometric methods into my empirical work. In the preface the authors state,

“Anyone interested in using data to shape public policy or to promote public health must digest and use statistical results. Anyone interested in drawing useful inferences from data on people can be said to be an applied econometrician.”

I would argue that data scientists should ask causal questions, and econometrics methodology is a natural fit for applying a causal framework to data-driven research. In fact, last year the Harvard Data Science Initiative started a program to investigate causal inference for machine learning. The push for integration of causality into data science, is mirrored by the slow acceptance of machine learning by economists, indicative of a two-way relationship. For example, distinguished economists, Susan Athey and Guido W. Imbens, advocate for the adoption of machine learning methods for empirical economic work. In the 2019 edition of the Annual Review of Economics, in a paper aptly titled “Machine Learning Methods That Economists Should Know About”, Athey and Imbens state,

“the methods developed in the ML literature have been particularly successful in big data settings, … For such settings, ML tools are becoming the standard across disciplines, so the economist’s toolkit needs to adapt accordingly while preserving the traditional strengths of applied econometrics.”

Testing for Causality

The process of causal analysis can be broken down into four stages: the first step is to model the causal question, the second to identify the estimand, the third to estimate the effect and fourth to refute the obtained estimate. For the sake of simplicity, in this article I will primarily focus on implementing the four stages of causal analysis with DoWhy, a Microsoft-developed Python library for causal inference. Additionally, in order to replicate the main results of the case-study paper, I also utilize the Python library, linearmodels.

When modeling the causal question in the first stage, it is necessary to make the causal assumptions explicit, and DoWhy makes this possible with causal graphs. To borrow econometrics terminology, causal graphs are probabilistic graphical models that encode assumptions about the data generating process; essentially a causal graph encodes prior knowledge. For a brief introduction to causal graphs, I suggest this Medium article, and for a thorough discussion of causal graphs, I suggest “The Book of Why” by Turing Award winner, Judea Pearl and science writer, Dana Mackenzie.

As a computer scientist, Pearl is well known for his work in artificial intelligence and for developing Bayesian networks, however, it is his work in causality that has been the most useful for my research. In fact, the “do” in DoWhy comes from “do-calculus”, which is a formal language that Pearl invented to discuss causality; this pedagogical paper by Robert R. Tucci provides an overview of the language, including proofs and rules. The details of do-calculus is beyond the scope of this article, but the salient point is that DoWhy adopts a Pearlian framework by using do-calculus to build causal graphs. Specifically, DoWhy relies on graph-based criteria and do-calculus for modeling assumptions and identifying a non-parametric causal effect.

Explicitly identifying assumptions, is only one advantage of using DoWhy’s causal framework, importantly, DoWhy separates the identification of causal effects in the second stage, from the estimation of the effect in the third stage. To quote the Microsoft Research blog post introducing the library,

“Identification of a causal effect involves making assumptions about the data-generating process and going from the counterfactual expressions to specifying a target estimand, while estimation is a purely statistical problem of estimating the target estimand from data.”

As seen in the diagram below, with the DoWhy framework, the identification stage is kept separate from the estimation stage.

Separation of the identification and estimation stages of causal analysis with the DoWhy library. Source.

The separation of the estimation stage allows for the implementation of estimation methods based on the potential-outcomes framework, which relies on counterfactual conditionals. In an arxiv paper introducing Do-Why (2020), the researchers credit their usage of potential outcomes in estimation methods to Guido W. Imbens and Donald B. Rubin (2015). Mirroring the strategy chosen by Abrahams, since the case-study paper makes use of an instrumental variable, in this tutorial, the method of estimation will be a two-stage least squares (2SLS) regression. The final stage involves refuting the estimate obtained in the third stage. In “Hard traveling” Abrahams conducts several robustness checks which are detailed in a technical appendix, however, for simplicity, in this article we make use of DoWhy’s automated robustness checks.

Experimental design of “Hard traveling”

The author frames this impact evaluation as division between political science theory and economic literature; the latter suggests that unemployment in urban labour markets is a result of technical shortcomings (eg. lack of infrastructure) while the former makes the claim that political reform is a prerequisite to address unemployment. There are very few political science papers on this topic that employ causality, and none that investigate the causal impact of infrastructural interventions, therefore, “Hard traveling” fills a gap in the existing political science literature using econometrics.

At face value, the aim of this paper is to evaluate the causal impact on Palestinian unemployment rates of Israeli army checkpoints and road obstacles deployed along the internal road network of the West Bank. The obstacles were deployed for security reasons, but had the effect of disrupting Palestinian commuter travel. To paraphrase, the author argues that, Israeli obstacles prevented peri-urban Palestinian commuters from reaching commercial centers and border crossings, causing employment losses for the commuters. The losses, however, were substantially offset by employment gains among the commuters’ more centrally located Palestinian competitors. The paper makes a strong claim that marginal economic interventions will serve to alter the spatial distribution of unemployment, but will not reduce overall unemployment levels.

This study is possible because of “the confluence of spatio-temporally disaggregated data and a plausibly exogenous connectivity shock”, that is to say the number of Israeli checkpoints sharply increased during the Second Intifada (2000–2004) — the exogenous shock. Additionally, the author acquired spatio-temporal disaggregated data by traveling twice to Palestine (owing to the on-site requirement for access), to collect neighbourhood-level census data from 1997 and 2007, which was augmented with satellite imagery, adding another spatial component to the dataset. Post-uprising the erected obstacles remained in place, as seen in the map below from the end of 2007. The satellite image shows the dispersion of Israeli road obstacles in the West Bank of Palestine, and the solid red line represents the Israeli separation barrier — a 708 km wall erected at the outset of the second uprising (2000) that remains in place today, over two decades later.

December 2007, placement of Israeli road obstacles in the West Bank. Source: Abrahams, 2021

With the hard-won census data, the dependent variable in this study is the change in employment outcomes for 480 Palestinian neighbourhoods, between 1997 and 2007; hereafter referred to as: % change in employment. These 480 neighbourhoods can be clustered into 310 supra-neighbourhoods as determined by World Bank/PCBS poverty clusters.

An instrumented 2SLS first-differences strategy is applied to test for causal effects of obstacles on unemployment, where the instrument is the lengthwise proximity of Israeli settlements to Palestinian commuter routes. This instrument fulfills the three conditions required by definition, which allows for a qausi-randomization of the blockadedness of Palestinian neighbourhoods. The lengthwise proximity to Israeli settlements should affect the placement of Israeli checkpoints, since the checkpoints’ purported purpose is to defend the settlements. This settlement proximity will have an indirect effect on the % change in employment by virtue of the connection to the placement of Israeli checkpoints. Confoundedness of the lengthwise proximity on the % change in employment, is tested through refuting the causal estimate with robustness checks.

This approach is effective because the instruments isolate the subset of obstacles that are deployed within the immediate vicinity of settlements, rather than directly regressing the % change in employment on the overall presence of checkpoints. In this study, Abrahams uses a 2SLS regression because it shows that the instrument is uncorrelated with the error term of the first stage regression, but is correlated with the dependent variable in the second stage, once we account for the independent variables.

The independent variables are labeled obstruction and protection, where the former relates to the direct obstructive effects of the obstacles and the latter represents the obstacles’ indirect protective effects. Accordingly, the instrumental variables are referred to as iv_obstruction and iv_protection, representing the lengthwise proximity to Israeli settlements. Histograms of the dependent, independent and instrumental variables used in this study are shown in the image below, taken from the paper’s technical appendix that can be accessed here.

Histograms of the dependent, independent and instrumental variables. Source: Abrahams, 2021

It is clear that there is healthy variation across all variables, which implies that the regression results are unlikely to be driven by outliers. Effectively, the regression shows countervailing effects, where the obstruction effects are balanced by the protective effects such that the attenuated net effects will be close to zero. As seen in the paper excerpt below, the author tests three effects for three hypotheses, an obstruction effect, a protection effect and an attenuated net effect.

The three hypotheses stated in “Hard traveling”. Source: Abrahams, 2021

The economic literature on urban labour markets, situates unemployment as the byproduct of poor infrastructure, providing support to economic policy recommendations that focus on improving urban infrastructure. Accordingly, transit infrastructure shocks such as road obstacles, should impact the supply of labour from residential neighbourhoods to commercial centers. The residential or origin neighbourhood is denoted with j, and k represents the destination neighbourhood ie. the commercial center where there are employment opportunities.

As mentioned, the independent variables, obstruction and protection, describe the direct and indirect effects of the obstacles, respectively. Abrahams generates two treatment variables, ∆obstructionⱼ and ∆protectionⱼ, where the first quantifies the degree to which Palestinian labourers from neighbourhood j are blockaded from accessing jobs in neighbourhood k, and the second quantifies the reverse protective effect on neighbourhood k due to the obstacles decreasing the labour flow competition from neighbourhood j.

Spatial histograms of the West Bank, with percentile distributions of the independent variables, obstruction on the left map and protection on the the right map. Source: Abrahams, 2021

The ∆protectionⱼ variable is defined in the equation below, where dₖⱼ represents the road distance between (k, j), and num_obstaclesₖⱼ represents the number of obstacles on path dₖⱼ. Neighbourhood k’s pre-uprising share of the economy’s labour force is represented by nₖ, and there are 480 neighbourhoods as units of observation.

∆protectionⱼ as the average number of obstacles on travel path (k, j), weighted by nₖ and inverse weighted by dₖⱼ. Image adapted by author from Abrahams, 2021.

There is an additional obstruction effect from the obstacles deployed on paths to the Israeli border, especially since pre-uprising, 21.6% of Palestinian labourers traveled daily to Israel for work. To calculate ∆obstructionⱼ, Abrahams first calculates a ∆obstruction_naiveⱼ which is the average number of obstacles on the path between (j, k), inverse weighted by path dⱼₖ and weighted by a variable mₖ; where mₖ represents neighbourhood k’s pre-uprising share of the economy’s business. The more important a destination k is, the more harmful the obstructive effect, therefore mₖ adjusts for the relative importance of neighbourhood k as determined by business activity.

As seen in the image below, the Israel-bound labourers were used to weight the number of obstacles between the residential neighbourhood j and the border of Israel. Where as, ∆obstruction_naiveⱼ is weighted by the remainder of the labour force.

Equations to calculate ∆obstructionⱼ, which also accounts for the percentage of the Palestinian labour force who were employed in Israel, as recorded in the 1997 census. Image adapted by author from Abrahams, 2021.

Notably, in order to calculate mₖ, Abrahams utilized radiance-calibrated nighttime satellite imagery to estimate the business activity of a neighbourhood k. However, the nighttime satellite images were blurry, so business activity in nearby Israeli settlements artificially inflated the calculated luminescence of Palestinian neighbourhoods. Resourcefully, Abrahams invented a method of deblurring the nighttime satellite images, so as to more accurately calculate the business activity of a Palestinian neighbourhood.

Next, the instruments ∆iv_obstructionⱼ and ∆iv_protectionⱼ follow the format of the treatment variables, where a ∆iv_obstruction_naiveⱼ is calculated first and ∆iv_obstructionⱼ is weighted by the percent of the Palestinian labour force that works in Israel. With the instruments, however, the effect is not calculated by the number of obstacles, but rather the lengthwise proximity of an Israeli settlement to a commuter path. By assuming a buffer zone around settlements, the proximity is calculated via the total length of the road segments on a commuter path that fall within the buffer zone of a settlement, such that it is representative of potential positioning of a road obstacle. In the image below, the instruments are weighted by mₖ and nₖ, similarly to their treatment counterparts.

Equations to calculate the instruments, which also account for the percentage of the Palestinian labour force who were employed in Israel, as recorded in the 1997 census. Image adapted by author from Abrahams, 2021.

In the following section, a causal graph to model the question will be built and I will introduce the controls aka. dummy variables.

First Stage: Causal Model

As a first step, the dataset for this tutorial can be accessed here. I suggest using a Google Colab notebook when working with DoWhy, since the graph visualization library under the hood, pygraphviz, can be tricky to install in a local environment. The following two commands run in Colab will correctly install pygraphviz and DoWhy.

!apt install libgraphviz-dev
!pip install pygraphviz dowhy

If using Colab, use the following two lines to load the “Hard traveling” dataset, which is stored as “.dta”, indicative of Stata files.

from google.colab import files
files.upload()

In the code snippet below we import the libraries needed, and use pandas to read the Stata file into a dataframe. Columns in the loaded dataframe are renamed from the satellite image-tagged data columns to match the variables described in the previous section. Next, it is necessary to normalize the independent and instrumental variables to facilitate a comparison of the effects.

Next, we add in all the necessary dummy variables to replicate the paper’s main results. The governorate dummies control for governorate-level employment trends for the 480 neighbourhoods. The remainder of the dummies represent neighbourhood-level covariates, such as earth mounds and partial checkpoints, which are excluded from the regression in order to reduce attenuation bias, since these “obstacles” did not interfere with passing traffic. The settle_dummies are related to the exclusion condition of the instruments which is discussed in the second stage. All the dummy variables are collected into a single list, all_dummies.

DoWhy makes it very simple to build a causal graph. The dataframe is passed to the imported CausalModel, the treatment variables, ∆obstruction and ∆protection, are set, the outcome (dependent) variable is defined as the change in employment, and the instrumental variables ∆iv_obstruction and ∆iv_protection are set. The “common_causes” argument is where we pass all_dummies, this allows us to control for the governorate-level trends and the neighbourhood-level covariates.

DoWhy produces the following causal graph to show the relationships between the independent, dependent and instrumental variables. In the image below, I have excluded the governorate-level trends and the neighbourhood-level covariates since they clutter the diagram. I simply ran the CausalModel without the common causes argument to produce the diagram, but all_dummies are included in the causal model for the rest of the tutorial.

Causal graph showing the relationship between the instruments, the independent variables, and the dependent variable: the change in employment between 1997 and 2007.

Second Stage: Identify the estimand

DoWhy makes it easy to identify estimands from the causal graph. The following command describes the model we have created:

model.interpret()

The output should read “Model to find the causal effect of treatment [‘obstruction’, ‘protection’] on outcome [‘chng_employment’]”. To identify the estimands from the model, we run the following two lines:

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

As seen below, the output of “identify_effect” will include the instrumental variables we specified when building the model.

Included in the output are the estimand assumptions of the instruments, here the exclusion condition of the instruments is stated. Abrahams defines this as, “the proximity of Israeli settlements to Palestinian travel routes, while evidently predictive of subsequent blockadness, should not have affected the evolution of Palestinian neighbourhoods’ employment rates via any other causal chanel”. In the fourth stage we use robustness checks to show that the settle_dummies which are indicative of a settlement’s proximity to a Palestinian neighbourhood (which could precipitate violence) is an alternative channel that operates orthogonally to the instruments, therefore, having no impact on the regression results. The careful inclusion of settle_dummies means that we are accounting for the exclusion condition, which allows us to proceed with the causal analysis without violating the estimand assumptions identified in this second stage.

Third Stage: Estimate the effect

The next step is to estimate the causal effect using the identified estimand, and as mentioned previously, Abrahams uses 2SLS as the estimation method. In the first stage of the 2SLS, as seen in the image below, instrumental substitutions of the treatment variables are created from their regression on the instruments, while accounting for the governorate-level trends and the neighbourhood-level covariates.

First stage of 2SLS, the instruments (∆iv_obstructionⱼ and ∆iv_protectionⱼ) are used to regress ∆obstructionⱼ and ∆protectionⱼ. Image adapted by author from Abrahams, 2021.

In the second stage, as seen in the image below, the intermediary variables (hatted ∆obstructionⱼ and ∆protectionⱼ) estimated in the first stage, are used to regress the change in employment, along with the governorate-level trends and the neighbourhood-level covariates.

Second stage of 2SLS, the variables estimated in the first stage are used to regress the change in employment from 1997 to 2007. Image adapted by author from Abrahams, 2021.

DoWhy has a built in instrumental variable 2SLS method that we can use to quickly regress the change in employment on the treatment variables in two stages. The identified estimand is estimated with the DoWhy method named “iv.instrumental_variable” which is built on statsmodels’ IV2SLS.

The output of the above code snippet is shown below, note that the standard errors are not heteroskedastic robust since the underlying statsmodels’ function does not allow for specifying the type of standard error for the covariance estimator.

Results of an instrumented 2SLS for regressing the change in employment on the treatment variables. Image by author.

As seen, there is a negative causal effect associated with obstruction (-2.38) and a positive causal effect associated with protection (+3.79). The t-tests results are large enough to indicate that there is a significant difference between the sample data and the null hypothesis. Below, the results of the main regression from “Hard traveling” are spatially displayed on a map of the West Bank; the warm colours represent the negative obstruction effect and the cool colours represent the positive protection effect.

Spatial histogram showing the results of the main regression, the obstruction effects are higher in rural areas and the protection effects are higher in the commercial centers. Source: Abrahams, 2021

The DoWhy results are similar, but they do not match Abrahams’ main results, since the regression was not weighted by the labour force from the 2007 census. In setting up the observational experiment, the labour force from the 1997 census is accounted for, however, to replicate the published results from “Hard traveling”, the 2SLS regression needs to be weighted by the variable labeled in the dataframe as: “lf_1_2007”. Unfortunately, DoWhy does not allow for weighted 2SLS regressions, due to the library’s dependence on statsmodels. Additionally, the 480 neighbourhoods can be clustered into 310 supra-neighbourhoods, and the statsmodels’ implementation lacks the flexibility to adjust for clusters. Therefore, I had three reasons to use the Python library linearmodels: to test for robustness to heteroskedasticity, to add a 2007 labour force weighting factor, and to cluster by the 310 supra-neighbourhoods.

With linearmodels, I firstly implemented a weighted OLS, secondly, a weighted 2SLS regression and thirdly, a clustered (and weighted) 2SLS regression. The results of the regressions are compared below and I have shared the full code of the regressions in a Jupyter notebook hosted on my Github.

Comparison of results for OLS, 2SLS and clustered 2SLS regressions with linearmodels. Image by author.

For the OLS and instrumented 2SLS regressions, the default covariance estimator is “robust ‘’ producing results that are robust to heteroskedasticity, mirroring the main results from “Hard traveling”. For the clustered 2SLS regression, the 480 neighbourhoods are grouped into 310 clusters and the covariance estimator is “clustered”. As seen in the table below taken from “Hard traveling”, the main results for the weighted 2SLS with controls is -3.75 for obstruction and +3.50 for protection, which is matched by my linearmodels’ result table above.

Main results table from “Hard traveling”. Source: Abrahams, 2021

Fourth Stage: Refute the estimate

The final stage is to refute the obtained estimate with robustness checks, and it is at this stage that the model is tested for robustness to confounders. In “Hard traveling”, there are extensive robustness checks, for example, fatalities data was added to the regressions to test for violence-induced measurement error. When loading the data for this tutorial, I realized the extensiveness of the robustness checks; the dataframe has 734 columns for 480 observations! I would need an additional post to cover the details of these checks, so instead I focus on general types of robustness checks made available by DoWhy that can be used to refute an estimate.

I tested three types of DoWhy refuters, a “bootstrap_refuter”, a “data_subset_refuter”, and a “random_common_cause” refuter. The “bootstrap_refuter” refutes an estimate by running it on a random sample of the data containing measurement error in the confounders. The “data_subset_refuter” refutes an estimate by rerunning it on a random subset of the original data. The “random_common_cause” refutes an estimate by introducing a randomly generated confounder, that may have been unobserved. The code snippet below is adapted from a DoWhy example notebook which details how to iterate over multiple refuters.

The following image shows the results of running the multiple refuters on the DoWhy causal estimate obtained in the third stage.

Refutation results of robustness checks with DoWhy. Image by author.

Final Thoughts

This Python tutorial for causal analysis was intended to showcase the usefulness of econometrics, and to encourage other data scientists to incorporate causality into their empirical work. Using “Hard traveling” as a case-study paper was a wonderfully engaging learning experience, it added the necessary context required to develop an appreciation for applied econometrics. The intersection of data science and econometrics for causality is an area I intend to explore further, and my hope is to translate my learning journey into accessible tutorials.

As humans, we naturally think in causal terms and extending this type of thinking to machine learning feels like a natural extension. Prominent AI researchers, such as Yoshua Bengio, have advocated for the extension of causality to machine learning; interestingly, Bengio recently proposed causal learning as a way to solve the model generalization problem. Data science aside, I believe that as machine learning improves at answering causal questions, the more useful it will become for economics. I will explore this reverse angle in a future article where I estimate a conditional average treatment effect (CATE) using Keras and EconML, solving an economics question with machine learning. Lastly, many thanks to Alexei Abrahams for answering questions and providing thoughtful feedback.

For questions or comments, connect with me on Linkedin, I am interested in hearing how others use econometrics for data science!

--

--

Data Scientist enthusiastic about machine learning, social justice, video games and philosophy.