Note: This report is part of a bigger project about climate change that can be seen (and hopefully loved) in this GitHub repository.
When some time ago I’ve heard from my television about the last Australian bushfire season, it was really terrible to hear. I was on another project at that time, but I’ve put this on my to-do-list, promising to myself that I would have worked on that to get an unbiased and data-driven opinion about Climate Change.
Let’s get started.
Schedule:
0. The libraries
- Overview
- Stationarity test
- Extreme events study
- Data modelling
- Conclusions
0. The Libraries:
Here’s the collection of the libraries that have been used. Additional libraries may be required during the story, but all the libraries can be found in the notebooks that are reported.
1. Overview
This is the dataset that has been used. A stricter time range has been analyzed for the following reasons:
- Data from 1850 to 1950 presents a significative number of NaN values
- The study wants to focus on the close past to obtain insight about the close future
- Computational necessities: the dataset contains 75000+ entries, and it would have become twice bigger.
The dataset that has been used consider all the cities temperature month by month from 1951 to 2013:
The description of the dataset suggests that a wide range of Celsius degrees (from -24 to +38 degrees) has been covered. As it has already been said, 75207 values have been obtained.
Deeper information can be obtained using the geographical features (Latitude and Longitude). Good heterogeneity can be found in the dataset locations, thus permitting a complete statistical analysis.
An example of 4 cities time series has been shown.
2. Stationarity test
One of the major tool for Data Analysis is the study of the stationarity of a time series. In this case, the stationarity is extremely informative as the entire process regards the change of the climate. Moreover, a lot of statistical approaches require stationarity. It is thus interesting to test the stationarity of the time series of each city temperature.
A non negligible correlation can in fact be found between the year and the average temperature. Let’s go deeper.
A famous test to study the stationarity of a time series is the augmented Dickey-Fuller test. This test hypothesizes that the time series is described by an autoregressive model + a white gaussian delta correlated noise. The null hypothesis is the non stationarity of the time series, while the alternative is that the time series is actually stationary.
The test has been performed in the following way:
Giving the surprising result that almost all the cities are stationary, as we can’t confirm the null hypothesis and we have to reject it for 99 cities out of 100.
So it could be immediate to conclude that:
" Climate change is a hoax "
But is it?
The first strange effect is that the percentage of stationarity cities increase with respect to the time range if we start considering the first year 1951 and perform the (aug) Dickey Fuller test in the next 5,10,15… years range (e.g. First Time Range : 1951–1956, Second Time Range: 1951–1961 …) :
So it is like the cities fall in the stationarity trap and they don’t move from there… it is actually strange.
Let’s go even deeper. London is a city whose time series has been proven to be stationary for all the time ranges:
But if we consider the year average time series, an explicit trend is visible:
And the augmented Dickey-Fuller test manages to get it:
This is actually true for 93 cities out of 100. In fact almost every city has a non-stationary time series when the year average values are considered.
And even a city like Tokyo that has been considered to be stationary, actually has its visible trend.
So what can we conclude from this analysis?
Results:
- The mathematical point of view. The Dickey Fuller test is disturbed by the noise in the month average Time Series, and it is thus not able to get interesting results in the month average, while it can give interesting insights about the year average.
- The qualitative point of view. Climate change is not an hoax: climate is actually changing but we are perceiving the month average up and downs while the "change" is verified at a different time scale.
3. Extreme events study
The study of extreme events is the statistical branch that considers the statistical distribution of extreme points (maximum or minimum) of a distribution. In fact it is well known that if we consider the same experiment for certain number of times (N approaching infinity) we can conclude that the points distribution has a gaussian shape in the core. Nonetheless we can’t conclude anything about the tails of this distribution.
Why it is important to us?
We may want to have an idea of the probability of having the maximum value of temperature at 30°, or 28° or 10° in the next year. Moreover, as we all know that we’re going to have the highest temperature during the summer, it becomes even more interesting as it permits to forecast (or at least having a probabilistic idea of) the peak that we’re going to have during the hottest summer day.
A powerful theorem is the Fisher-Tippett-Gnedenko and it states (concretely speaking) that the distribution of the maximum value can be considered to be a special distribution that in our case collapses to be the Weibull one, but let’s get there step by step.
As an example, let’s consider the city of Los Angeles.
Estimating the distribution using a KDE plot
Let’s pick a year of value (12 values) and extract the maximum from there.
Actually, let’s do this operation 5000 times to get a distribution.
Thanks to the (awesome) reliability module, the goodness of the particular extreme event Weibull distribution fit can be estimated. Actually, the module suggests that this is a success.
To test if this is actually a success, the Kullback-Leibler distance has been considered between the distribution obtained and the maximum one.
Then the following process has been used to test the output value:
The Weibull is a good fit for Los Angeles extreme events.
The Weibull fit has been done for each 10 years, obtaining Weibull distributions whose maximum values move towards right.
The maximum of the next year is actually placed in an high value of probability of the precedent decade fitted Weibull (forecasting ability).
The same test described before has been applied to all the cities of the dataset.
The cities that has a successful Weibull test are stored into a dataset, and the Weibull are fitted from decade to decade (exactly like Los Angeles).
The maximum of the fitted Weibulls are stored, displayed and fitted with a linear fit.
The signs of the slopes are almost all positive.
Results:
- The mathematical point of view. The extreme events of a large portion of the cities in the dataset can be identified by a Weibull distribution. The Weibull distributions fitted for each decades move towards right.
- The qualitative point of view. The maximum of each summer is going to increase decade by decade – > global warming.
- BONUS: Predictive ability. Check it out here.
4. Data modelling
This part is the most challenging, and I’m still deep into it.
The challenge is to use the known distribution (the gaussian distribution). This chapter highlighted the complexity of this topic. Let’s give a look at the distribution of 5 cities:
It’s difficult to fit a gaussian out of it, or any other known distribution. Nonetheless, as a first approach, let’s consider three methods to see if a distribution is actually gaussian.
- Kurtosis test. Testing if the tails are gaussian-like
- Skewness test. Testing the symmetry of the distribution
- Kolmogorov Smirnov. Testing if the gaussian distribution and the real distribution are compatible.
The Kurtosis test highlights that almost all the cities have a non gaussian Kurtosis value except for some equatorial cities.
This is meaningful!
In fact one of the reasons why the distribution are so complicated is because of the season. This is why the equatorial cities are more likely to be gaussian as seasonal changes are minimal.
Almost the same conclusion can be obtained by the usage of the Skewness test, but it is less rigid and consider a bunch of other cities to be gaussian.
The union between the two cities set has been taken, and the following distribution have been considered to be gaussian.
Mhhhh… not that satisfying.
Kolmogorov Smirnov test highlights that actually there is no city with an acceptable gaussian hypothesis.
Actually, if we isolate the single seasons, it is possible to have a slightly (I mean, really slightly) better results, as Santiago shows a gaussian behavior in its distribution of the summer temperatures.
Results:
- The mathematical point of view. The gaussian distribution is suitable for a lot of physical systems when the observation conditions are under control. Complex systems like these ones are hard to fit with gaussian distributions.
- The qualitative point of view. Climate change is a really complex topic, and each city should be at first considered alone, and integrated with the general scenario as a second step.
- BONUS: Ongoing studies. Check it out here.
Conclusions
For each chapter, some important conclusions have been obtained, here summarized:
- The maximum of each summer is going to increase decade by decade. And it is alarming.
- The climate is changing at a rate that is difficult to perceive. That’s why we should trust data and climatologists.
- Climate change is a complex topic, and deep studies are required to have useful information about it.
If you liked the article and you want to know more about Machine Learning, or you just want to ask me something you can:
A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writer) write about the newest technology available.