The world’s leading publication for data science, AI, and ML professionals.

How to Build a Causal Inference Model to Explore Whether Global Warming is Caused by Human Activity

How to use Python and the DoWhy library to build a causal inference machine learning model to explore the causes of global warming

Photo by Patrick Hendry on Unsplash
Photo by Patrick Hendry on Unsplash

Introduction

I published an article recently to provide a simple tutorial of how to get up and running with Causal Inference using the Pgmpy library –

A Simple Explanation of Causal Inference in Python

One of the readers of the article pointed out that the simple nature of the data (for example vaccination = 1 or 0, disease = 1 or 0) meant that causal inference models could only be used to solve relatively simple problems and hence that they have limited applicability to real-world problems.

This prompted me to look for a BIG problem to try to solve using a causal inference model. I chose "Global Warming" and in particular climate denial for these reasons –

  • If I could develop a working model it would refute the proposal that causal inference is limited to solving simple problems.
  • There are datasets available (if you look hard enough!).
  • It is current and topical with the recent United Nations COP27 Summit having being held in Sharm El-Sheikh in Egypt in November 2022.

Getting the Data

The key data sources used in the model are –

  • World Bank Data
  • Global Monitoring Laboratory
  • NASA

All of these data sources are available for public use. The full details including licenses and terms-of-use can be found in the "References" section at the end of the article.

Here is the combined climate data …

Image by Author
Image by Author

… and here is an explanation of the features …

Image by Author
Image by Author

Exploratory Data Analysis

The following section explores the data to identify and visualise the correlations.

I have left the code out to keep the article focused on the causal inference aspect but if you would like to view and run the code it is all available here – https://github.com/grahamharrison68/Public-Github/blob/master/Causal%20Inference/Climate%20Change%20Model-v3.ipynb.

Correlations

Let’s start by taking a quick look at the correlations between the main features …

Image by Author
Image by Author

It appears that there are some high correlations so let’s have a look at these relationships in more detail …

Population vs. Energy

Our first analysis is to explore the association between global population and energy consumption …

Image by Author
Image by Author

This comparison suggests a linear increase between world population increase and energy use correlated with r=0.931.

Overall Energy vs. Fossil Fuel Consumption

Given the correlation between population and energy the next plot looks for any correlation between energy consumption and the specific consumption of fossil-fuels …

Image by Author
Image by Author

There is a near-perfect correlation between overall energy use and fossil fuel use which is hardly surprising but still useful to visualise.

Fossil Fuel Energy Consumption vs. CO2 Concentration

Following the forward chain, if there is a correlation between energy consumption and fossil-fuel use, what is the association between fossil fuel use and CO2 concentration levels? …

Image by Author
Image by Author

There is a strong correlation, but suspend any conclusions about cause-and-effect for now as this will be the subject of the next part of the article.

CO2 Concentration vs. Temperature Change

The last piece of the puzzle is to explore any potential correlation between CO2 concentration and global temperatures …

Image by Author
Image by Author

There is a correlation between these two features which might be reasonably expected, but to move from correlation to causation requires the development of a causal model …

Moving from Correlation to Causation

We know that the following correlative relationships exist –

  • Population increase correlates with increased energy consumption.
  • Increased energy consumption correlates with increased use of fossil fuels.
  • Fossil fuel use correlates with increased levels of CO2.
  • Increases in CO2 correlates with increases in temperature.

However, we also know the famous quotation attributed to Karl Pearson that "correlation does not imply causation"

(https://thedecisionlab.com/reference-guide/philosophy/correlation-vs-causation)

So what can we say, if anything, about moving beyond correlation to a reasonable assumption of causation between human activity (energy consumption) and global warming?

Having read "The Book of Why" by Judea Pearl and having studied many other sources it is a reasonable conclusion that one of the following must be true if a correlation is found –

  1. That the correlation does imply causation, or …
  2. That the correlation is caused by a "confounding" factor (more on this later).

.. and based on additional reading and research I would like to add a third possibility –

  1. That the correlation is entirely spurious and purely co-incidental.

Let’s explore these three possibilities in reverse order …

Spurious Correlations

Consider the following which is my all-time favourite spurious correlation attributed to Tyler Vigen (see References section for attribution).

Spurious Correlations, https://www.tylervigen.com/spurious-correlations
Spurious Correlations, https://www.tylervigen.com/spurious-correlations

Proof will be hard to come by but surely every fibre of common sense screams out that the statement "couples who want to stay married should eat less margarine" cannot be true. Hence the correlation in the graph is taken to be completely spurious, co-incidental and accidental.

A highly correlated and completely spurious correlation would be difficult to refute in the real world but they can be reasonably challenged by …

  • Asking domain experts.
  • Looking out for wildly different y-axis in the data.
  • Using confidence intervals.
  • Continuing to monitor the data to see if the correlation continues or diverges in the future.

The confidence interval for margarine and divorce would be high so in that example we would need to find and ask experts in dairy products and marriage guidance to prove or refute the hypothesis that consuming margarine causes divorce, or more likely just keep monitoring the trend and wait for the inevitable future divergence in the data.

Confounding

To visualise the issue of confounding, let’s return to our model and look at the correlation between non-fossil fuel consumption and CO2 concentration …

Image by Author
Image by Author

On the face of it this is worrying and confusing. Non-fossil fuels are assumed to be nuclear power and renewables, neither of which should cause an increase of CO2 concentration which in turn could cause temperatures to increase (if a causal link can be established).

So how can it be that the more non-fossil fuel energy we use the more CO2 levels increase?

What is happening here is that increases in population cause more energy to be produced and consumed. As more energy is required this causes an increase in both fossil fuel and non-fossil fuel energy consumption.

The fossil-fuel energy consumption is causing the CO2 levels (if we can prove our theories) but as energy increases both fossil and non-fossil energy consumption are increasing.

Energy is have a "confounding effect" which is leading to the correlation between non-fossil fuels and CO2 emissions.

Confounding effects can be a difficult concept to understand but I think this example illustrates it very nicely.

In summary there is a correlation between non-fossil fuel energy consumption and CO2 emissions that is neither causal nor spurious. It is wholly caused and explained by the confounding effect of energy on both non-fossil and fossil fuel energy consumption.

A Proposed Causal Model

The purpose of this article is to show that causal inference machine learning solutions can be used to build models to solve complex, large and meaningful real-world problems, hence I am going to put forward a proposal for the causal links that would be tested and improved by climate and energy experts in a real-world model.

The next step is going to use my DirectedAcyclicGraph class. I have left the source code out of the article to keep it more concise but here is the link to the full source in case you would like to run the code for yourself – https://gist.github.com/grahamharrison68/9733b0cd4db8e3e049d5be7fc17b7602.

If you do decide to use it and if you like it why not optionally consider buying me a coffee? https://ko-fi.com/grahamharrison

Here is my proposed model …

Image by Author
Image by Author

… and these are the meanings of the links …

  • Population (POP) increases are causing an increase in energy consumption.
  • Energy usage (ENGY) is causing increases in both non-fossil and fossil fuels.
  • Use of fossil fuels (FFEC) is causing an increase in CO2 concentration (CO2C) levels.
  • Increases in CO2 concentration (CO2C) are causing an increase in global temperatures (TMPI).

… but what if there is more to it than that?

The Big Question: Could Something Natural be Causing Global Warming?

There is a strong case for human impact having a causal impact on global warming based on the existing body of scientific evidence but there are still many people who hold different views …

Climate change denial, or global warming denial, is denial, dismissal, or doubt that contradicts the scientific consensus on climate change, including the extent to which it is caused by humans, its effects on nature and human society, or the potential of adaptation to global warming by human actions.

(https://en.wikipedia.org/wiki/Climate_change_denial#:~:text=Climate%20change%20denial%2C%20or%20global,global%20warming%20by%20human%20actions)

It would be very useful if we could extend the model developed so far to address this question and to provide some causal inference based evidence to refute this view and to add further empirical evidence to the accepted scientific opinion that global warming is caused by human activity.

To explore the potential for alternative causes for global warming we need to return to the concept of confounding.

We know from our earlier example that energy consumption is causal for both fossil fuel and non-fossil fuel use and hence that energy is "confounding" non-fossil fuel use causing a correlation with CO2 concentration that is neither causal nor spurious.

What if there is a confounder that is causing CO2 levels to rise and temperatures to increase that we do not know about? That sounds unlikely but if it were true it would unpick the case that fossil fuel energy consumption is causing CO2 concentrations and strengthen the view that natural causes are responsible.

If there is an unknown factor causing CO2 to increase and temperatures to increase then it is not currently represented in the proposed directed acyclic graph. Not only is it a confounder but it is an "unobserved confounder".

Therefore the solution requires the directed acyclic graph to be extended as follows with the addition of "U" for the potential "unobserved confounder" –

Image by Author
Image by Author

Believe it or not, causal calculus can be used to provide a reliable measure of the impact of increasing fossil-fuel energy use on temperature increase even if an unobserved confounder exists and is not captured in the data!

The maths is very complicated (and beyond the scope if this article) but fortunately some of the emerging set of Python causal inference libraries can solve for unobserved confounding.

My favourite light-weight Python library for causal inference is Pgmpy but it does not currently implement unobserved confounding. I have documented this issue and raised a ticket which is currently being looked at by the Pgmpy team. You can check the status of that ticket here – https://github.com/pgmpy/pgmpy/issues/1574 – which is scheduled for general release in version 0.1.21 on 31st December 2022.

In the mean-time a model is provided in this article using the DoWhy library which already solves for unobserved confounders…

Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
   d            
───────(E[TMPI])
d[FFEC]         
Estimand assumption 1, Unconfoundedness: If U→{FFEC} and U→TMPI then P(TMPI|FFEC,,U) = P(TMPI|FFEC,)

### Estimand : 2
Estimand name: iv
Estimand expression:
 ⎡                               -1⎤
 ⎢   d          ⎛   d           ⎞  ⎥
E⎢───────(TMPI)⋅⎜───────([FFEC])⎟  ⎥
 ⎣d[ENGY]       ⎝d[ENGY]        ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→TMPI then ¬(U →→{ENGY})
Estimand assumption 2, Exclusion: If we remove {ENGY}→{FFEC}, then ¬({ENGY}→TMPI)

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

Explanation of the Code

The main part of the code is the single-line call which creates the CausalModel instance. The parameters are as follows –

  • data=df_climate_change[climate_features] is telling the model which DataFrame and which features to use to build the model.
  • graph=climate_dag.gml_graph is instructing the model to use the causal relationships that have been proposed and represented in the climate_dag instance of the DirectedAcyclicGraph.
  • The "treatment" is the thing we are interested in changing and understanding, in this instance the use of fossil fuels.
  • The "outcome" is the impact we are interested in i.e. as the "treatment" (use of fossil fuels) changes what is the impact on the "outcome" (global temperature increase)?

The way that DoWhy works is a little un-intuitive at first. Many other libraries would just provide the answer at this point but DoWhy produces an intermediary step called an "estimand".

To calculate the causal effect one of three types of "path" or route has to be found from the treatment to the outcome in the directed acyclic graph –

  1. A "backdoor" path.
  2. A "frontdoor" path.
  3. An "instrumental variable".

Note: a fully detailed explanation of these three categories will be the subject of a future article.

The output from print(climate_estimand) looks scary and it put me off using DoWhy for a long time but essentially most of that output can be ignored.

It is sufficient to note that in our model DoWhy has found both a backdoor and an instrumental variable (iv), but no front door. We could use either of the two that have been found but I have chosen to use "instrumental variable".

Again DoWhy is different to most other libraries in that the method selected has to be explicitly stated in the code, there is no way to ask DoWhy to automatically select a suitable method for us.

Fortunately the code for selecting the "instrumental variable" is very straight-forward –

To note, the method names for all 3 possibilities are as follows –

  1. Backdoor: method_name='backdoor.linear_regression'
  2. Frontdoor: method_name='frontdoor.two_stage_regression'
  3. Instrumental Variable: method_name='iv.instrumental_variable'

That just leaves us needing a simple helper function to print out the results in a more readable format and the model will be complete!

Average Treatment Effect (ATE):
For every unit change (1 unit = 1 x Oil Equivalent per Capita (10K KG)) in "Fossil Fuel Energy Consumption" ...
"Global Temperature Increase" will change by +0.02322620230983087 (1 unit = 1 x Degrees C (Lowess Smoothing))

And there we have it!

For every 10,000KG increase per capita in fossil fuel energy consumption global temperatures will increase by 0.023 degrees Celsius and this causal increase holds even if there is an unobserved confounder acting on both CO2 and temperature!

Effectively this simple implementation of causal inference using Microsoft’s DoWhy library can be used to refute climate change denial and to provide an "average treatment effect" of by how much temperature will increase as non-fossil fuel energy consumption increases!

Conclusion

One of my readers responded to an earlier article on causal inference that the current libraries may only be applicable to simple problems.

That prompted me to find a big problem to solve and they do not come any bigger than climate change!

This article has demonstrated that causal inference can tackle big problems and provide solutions to types of problems that are not solvable using more traditional machine learning approaches.

Specifically the model provided evidence that human activity in the form of increases in fossil-fuel energy consumption have a causal impact on increases in global temperature.

Also, and critically, causal inference techniques move beyond "predictive analytics" where the model tells us what might happen if the future broadly turns out like the past towards "prescriptive analytics" where the model can tells us what might happen if we intervene and change things.

If you enjoyed this article please consider …

Joining Medium with my referral link (I will receive a proportion of the fees if you sign up using this link).

Join Medium with my referral link – Graham Harrison

Subscribing to a free e-mail whenever I publish a new story.

Taking a quick look at my previous articles.

Downloading my free strategic data-driven decision making framework.

Visiting my Data Science website – The Data Blog.

References


Related Articles