CO2 emissions dataset in USA: a statistical analysis, using Python

Extracting information from CO2 emissions dataset

Piero Paialunga
Towards Data Science

--

Photo by Marek Piwnicki on Unsplash

Disclaimer: This notebook has not been written by a climate scientist! Everything is exclusively analyzed by a data scientist point of view. All the statistical analysis are meant to be used as tools for a time series analysis of any kind.

Let’s start by stating the obvious:

The job of a data scientist is to extract insights.

The complexity of the tool that you are using is not really relevant. What is much more important is the fact that whatever you are using is useful to the kind of analysis that you want to investigate.

1. Introduction

In our case, we have a time series. This time series has, of course, time on the x axis (in particular, years recorded on monthly basis) and the carbon dioxide emissions generated from burning coal for producing electricity power in the United States of America on the y axis.

This dataset is publicly available and doesn’t have any copyright limitation (CC0: Public Domain). It can be downloaded for free, on Kaggle, here.

Let’s give it a look. First, let’s import the libraries:

Then, let’s plot the timeseries (the last 7 entries are of 2016, but it is a truncated year, so we cut that off);

The first datapoint is the first month of 1973, the last datapoint is the last year of 2015. No missing values, so the length of the dataset is exactly 12 x (2015–1973).

Now, what kind of analysis is it relevant in this case? Well, I think it is natural to ask three questions:

A. Is the CO2 emission generally increasing? What is the general trend?

B. Is there any recurrent pattern of this CO2 time series?

C. Is the time series “stable” or we can expect unpleasant surprises?

Let’s investigate it, step by step:

2. General Trend

We are interested in profiling the general trend of this time series. It’s not really a regression technique, because we are using the whole timeseries, and it’s not really a forecasting task, because… well because we are not fore-casting at all :)

It’s just a profiling of the trend, it is useful to model the time series in a general way. The code can easily be changed using more sophisticated regression techniques and/or splitting the dataset into training set and test set if you really want to have a predictive ability, which is not really our case at this time.

What we expect from our model is to see a massive increase (x3) from the starting point to the peak and then go back to more reasonable values, which are still around x2 the starting values.

A reasonable model (not too complicated, not too lossy) can be obtained using at least a third order model (it doesn’t look like a parabola, and it is definitely not a line :D). In my opinion degree = 4 is better, because it captures more details, but it doesn’t really make much of a difference. (again, if you want to be more robust about it, use the following approach on a smaller portion of the dataset, and use a validation set to pick the model with the lowest error).

So let’s fit our fourth degree model. Two steps:

A. Fitting:

B. Predicting + Plotting:

This looks exactly like we expected:

  1. Monotonically increasing until around 2002
  2. Monotonically (much quicker) decreasing from 2002 to 2015

We don’t have much more information about the specific source of the data, but we can say that there has been a sort of decreasing of emissions after a long time of increasing. The increase looks way more progressive (maybe even more natural, and related to the factory rhythm) and slower than the decrease. A guess could be that someone intentionally started to reduce the emissions (maybe with some kind of regulations) after dangerous values of emissions have been reached (I don’t really know, it’s just a plausible guess).

3. Recurrent Patterns

The considerations we have done in the previous chapter are interesting, but should not interfere with what we are going to do next: let’s detrend our timeseries.

This is how it looks like now:

Now, the timeseries is very up and downy, as you all can see.
Let’s analyze this ups and downs using a very powerful tool: the magic Fourier Transform.

I ❤ Fourier Transform, and I have used it so many times in my life. In particular, I have talked about it a couple of blog posts ago here. In there, you can find the idea of a Fourier transform, why it is useful, and hopefully you are going to love it as much as I do :)

Let’s prepare the data:

And let’s plot the fourier transform (amplitude vs period, in months)

Let me repost the picture for clarity:

Image by author

This is actually very interesting. Let’s consider, for brevity, that each pick on the y axis is a recurrent pattern that is verified periodically after exactly M months, where M is the x axis value relative to that pick.

Some important trends are verified:

  1. After around 2.5 months
  2. After three months
  3. After 4 months
  4. After half a year
  5. After a whole year

Again, we have no idea, but it does look like the CO2 emissions are related to the vacation time of the employers: usually they took their break after 6 months or 12 years that are the most prominent peaks. Another hint is given by the 2.5/3/4 months peaks, because not all the employers have their break in the exact time.

It looks very obvious now, but it is not obvious until you realize it is :)

4. Stability

To understand the stability of the time series we can do the following (reasonable) assumption:

The time series is the result of a physical process “disturbed” by some gaussian and white noise

I am a physicist and this is our butter and bread. This is how we approach every problem, and there are tons and tons of statistical considerations about the “gaussianity” of the errors. For now, let’s accept it and move on, exactly like you should do with a break up :)

A very efficient way to predict the mean value that comes out of your physical process and find the uncertainty of your dataset is to use the famous Gaussian Process Regressor (GPR). This is another tool that I really like because it is explainable, sufficiently general and easy to use. I talk about it in multiple blogposts as well (like here and here). Briefly, a GPR model give you a mean value and a variance (+-1.96 variance means 95.7% of probability of finding the points in that area). It may sounds confusing, let’s move slower.

Let’s consider 60% of the data points as training set.

Let’s fit the GPR model on the training set and apply the fitted model on the whole dataset. By doing this last step, we will have a predicted time series, in terms of a “mean” value and its relative uncertainty. Let me show you:

It looks kind of messy, but it actually does a very nice job:

Now, it order to see if it is “stable” , we can investigate how many points fall out of the predicted boundaries.

And the result is the following:

And again, let me show you the plot for clarity:

Image by author

This means that:

  1. Only two points are out of boundaries from 1973 to 1983
  2. 7 points are out of boundaries from 1983 to 1993
  3. 8 points are out of boundaries from 1993 to 2003
  4. 7 points are out of boundaries from 2003 to 2013

This means that, from 1983, the behavior of the time series has been much more irregular. But there is more! The time series is supposed to adapt to the change of behavior and the uncertainty should grow. This means that, even if the uncertainty grows, there is still a lot of points out of boundaries: the instability of the time series keep increasing. It is actually something that can be seen by looking at the time series in the first place.

5. Conclusions

Some take aways:

  1. It looks like the CO2 emissions have been set to decrease after decades of increasing
  2. The CO2 emissions are likely to be related to the vacation time of the employers or, at the very least, with the productivity rhythm of the company
  3. The time series has increasing instability, a big change of instability has been recorded after the first decade of data

If you liked the article and you want to know more about Machine Learning, or you just want to ask me something you can:

A. Follow me on Linkedin, where I publish all my stories
B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have.
C. Become a referred member, so you won’t have any “maximum number of stories for the month” and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available.

--

--

PhD in Aerospace Engineering at the University of Cincinnati. Machine Learning Engineer @ Gen Nine, Martial Artist, Coffee Drinker, from Italy.