The world’s leading publication for data science, AI, and ML professionals.

Using Permutation Tests to proof the Climate Change

A simple statistical test shows that average temperatures are very unlikely to increase due to "bad luck".

Photo by Lacie Slezak on Unsplash
Photo by Lacie Slezak on Unsplash

Introduction

We are currently experiencing an increase of the average temperature from one year to the next. But is this really statistically significant? The average temperature is varying a lot over time, so it might well be the case that the current situation is just "bad luck" and that we will be experiencing colder years in the future. Of course nobody exactly knows how the weather and climate will develop in the future, although most scientific models indicate that we have to expect an ongoing global warming. But when looking to the past, we can ask ourselves if the historic weather did behave like we would expect if there was no Climate Change.

Average Temperature per Year in Germany
Average Temperature per Year in Germany

The picture above shows the average temperature per year in Germany and it strongly indicates that there is a significant increase of the average temperature over time, especially during the last 30 years. This article will apply a very simple yet clever statistical test support this impression.


1. Prerequisites

We will build on the aggregated data from my small article series on "Investigating the Climate Change with Python and Spark":

  1. Getting the data. The first part is about getting publicly available weather data and about extracting relevant metrics from this data. Depending on the data source, this turns out to be more complicated than one might guess.
  2. Preparing the data. The next step will focus on preparing the data in such a way that we can easily answer questions about average measures like temperature or wind speed per weather station and per country.
  3. Generating insights. Finally, in the last part, we will be able to perform some analysis based on the prepared data that will show the climate change.

Specifically, you need the "daily weather per country" aggregate which has been produced in the second installment of the series.

Other than that we will continue to use PySpark to access the aggregated data and Pandas plus friends to conduct some simple statistical test.

1.1 Read in Data

First we read in the aggregated data again using PySpark

# We will be looking at Germany, whose FIPS code is 'GM'
country = "GM"
# Read in data again and filter for selected country
daily_country_weather = 
    spark.read.parquet(daily_country_weather_location) 
    .filter(f.col("CTRY") == country)
# Inspect Schema
daily_country_weather.printSchema()
Data Schema of Daily Weather per Country/State
Data Schema of Daily Weather per Country/State

Although the data contains additional metrics like minimum/maximum temperature and information on wind speed, we focus only on the average temperature for now.

1.2 Yearly Average Temperature

Since we are only interested in the average temperature per year, but the data contains information on a daily basis, we first need to perform a simple aggregation as follows:

# Use PySpark to create aggregates per year
yearly_weather = daily_country_weather 
    .withColumn("year", f.year(f.col("date"))) 
    .groupBy("year").agg(
        f.avg(f.col("avg_temperature")).alias("avg_temperature")
    )
    .orderBy(f.col("year")).toPandas()
# Reindex by year
yearly_weather.set_index("year", drop=True, inplace=True)

Before conducting a deeper analysis, let’s create a simple plot using Seaborn which will also include a regression.

plt.figure(figsize=(24,6))
sns.regplot(
    x=yearly_weather.index, 
    y=yearly_weather["avg_temperature"], 
    color="r")
Average Temperature per Year in Germany
Average Temperature per Year in Germany

We can clearly see not only an increase of the average temperature over time, but as consequence the graph also reaches new maximum values over time and the minimum values also tend to be much higher than in the past. We will focus precisely on this effect now.

First we create one more plot with the running maximum from left to right and the running minimum from right to left:

running_max = yearly_weather["avg_temperature"].expanding().max()
running_min = yearly_weather["avg_temperature"] 
                                  [::-1].expanding().min()[::-1]
plt.figure(figsize=(24,6))
sns.relplot(x=yearly_weather.index, 
            y=yearly_weather["avg_temperature"], 
            color="r", aspect=4)
plt.plot(running_max)
plt.plot(running_min)
Average Temperature per Year in Germany with running Maximum/Minimum Values
Average Temperature per Year in Germany with running Maximum/Minimum Values

Now the interesting question is: Do the maximum and minimum values behave like one would expect if there was no change of the underlying temperature distribution. More specifically we will ask: Assuming no change of the temperature distribution over all the years, how many distinct maximum and minimum values would we expect and what would be the probability of the number of maximum/minimum values that we are seeing here?

One interesting aspect of this type of statistical question is that the actual temperature values themselves are not used at all. Only the relative ordering of the values is relevant. This property of the statistical test proves to be very useful in many other cases where you cannot really assign meaningful values to increasingly severe events, but you still want to be able to test for randomness. Of course the result then is of a purely qualitative nature (i.e. the temperature is increasing at a statistical significant level) but the test will not tell you by how much the temperature is increasing.

2. A Little Bit of Theory

In order to approach the question whether the number of new maximum/minimum values for the yearly average temperature behaves as one would expect, we need to formalize the problem a little bit.

For the following examination, we generalize the problem to a vector of independent, identically distributed (IID) continuous random variables X(1)…X(n). All of the three properties are important here: We assume that every random variable X(i) in this vector has the same probability distribution and we assume that all variables are independent from each other. Finally we also assume that all variables are continues, which implies that the probability of two random variables having the same value is zero.

2.1 Record Statistics

We are now interested in what I call a running record: We call position i a record whenever the value of the random variable X(i) is larger than all previous variables X(j) with j < i. With the weather data, we are interested in the number of records of the vector of average temperatures.

Given a sequence of values X(1)…X(n) we can now define a simple Statistics as the number of observed records within that sequence and then we can ask if the value of that statistics looks plausible if we assume that all X(i) are IID. This assumption corresponds to the null hypothesis of "no climate change" in the temperature series.

Translating the question into a mathematical language, we would like to calculate (or at least estimate) the probability of having k records in a sequence of n IID random variables. This question immediately brings us to random permutations.

2.2 Random Permutations

Instead of answering the original question let’s transform it to an equivalent problem. To do so, we first assign the rank of the value to each random variable, i.e. the smallest random variable is assigned a rank of 1, the second smallest variable a rank of 2 and so on until the largest random variable is assigned the rank n. Let’s denote the rank of X(i) with R(i).

The following example gives us the ranks R(i) for the small sequence X(i) with 7 elements.

value X(i):  0.2  2.3  1.1  0.7  2.4  1.8  2.1
rank  R(i):   1    6    3    2    7    4    5

Note that the ranks R(i) still have the same relative order as the original values X(i), which in turn implies that the number of records in the ranks vector is the same as the number of records in the original values.

The set of all ranks R(1)…R(n) contains all the number from 1 until n, but in a random order. Assuming that the original values X(i) are IID, the vector of ranks R(i) is a _random permutation_. Therefore the probability distribution of the number of records in a sequence of IID random variables is the same as of the number of records in a random permutation.

2.3 Monte Carlo Sampling

Unfortunately, it turns out that the distribution we are looking for (the number of records in a random permutation) is not exactly simple to calculate. There are some formulas (see On the outstanding elements of permutations by Herbert Wilf), but these build on the unsigned Stirling numbers of the first kind which are not easy to calculate.

Instead of trying to evaluate some mathematical formula for the probability distribution we chose a different route: We simply let the computer conduct a reasonable huge number of random experiments by sampling random permutations, and in each experiment we calculate the statistics we are interested in. The histogram (i.e. the relative frequency of each possible outcome) then gives us an estimation of the true probability. This approach is called Monte Carlo Sampling and by the law of large numbers, the results will always converge to the real values.

3. Implementation

Now we have gathered all the required theory for implementing the statistical test if the number of running maximums/minimums in the temperature chart below is plausible under the null hypothesis that the climate doesn’t change (i.e. the probability distribution of the average temperature doesn’t change over time).

Running Maximum / Reverse Minimum of Average Temperature in Germany
Running Maximum / Reverse Minimum of Average Temperature in Germany

3.1 Monte Carlo Sampling

First we implement two small functions which actually count the number of running maximum/minimums for a given series. Note that we also count the first value of each series as the first maximum/minimum, therefore we add 1 to the sum in the last line of each function.

def count_max_records(series):
    running_max = series.expanding().max()
    records = running_max > running_max.shift(1)
    return records.sum() + 1
def count_min_records(series):
    running_min = series.expanding().min()
    records = running_min < running_min.shift(1)
    return records.sum() + 1

In the next step we implement the Monte Carlo sampling for constructing the probability distribution. Numpy already provides a function to generate a random permutation, so we only need to use one of the functions above to count the number of running records. Please note that by symmetry the distribution of the number of maximum records is the same as of the number of minimum records.

num_samples = 10000
permutation_size = len(yearly_weather.index)
# Setup an array holding the samples
samples = np.zeros(num_samples, dtype=int)
# Conduct random experiments and collect all samples
for i in range(0,num_samples):
    p = np.random.permutation(permutation_size)
    num_maximums = count_max_records(pd.Series(p))
    samples[i] = num_maximums

Using Seaborn we now visualize the collected samples as a histogram.

sns.histplot(data=samples)
Histogram of Sampling Number of Records in Random Permutations
Histogram of Sampling Number of Records in Random Permutations

We can clearly see that more than 10 records for a random permutation of length 92 (the number of years) is very unlikely. Actually, the expected number of records in a random permutation of length n is given as the _harmonic sum 1+ 1/2 + 1/3 + 1/4 + … + 1/n, which would be just 5.12 for n=92_.

3.2 Empirical Distribution

But of course we’d like to provide a numerical estimation of the probabilities instead of the wording "very unlikely" above. With the help of the venerable statsmodels package, this can be easily done from our Monte Carlo samples by constructing an empirical cumulative distribution function (ECDF) as follows:

from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(samples)

This function will estimate the following probability (as inferred from the samples):

ecdf(k) = Probability of k or less records in a random permutation

3.3 Probability of Observed Records in Temperature

Finally, we can now use this empirical distribution to estimate the probability of the observed number of maximum/minimum records under the null hypothesis of no climate change. First we use the two helper functions to calculate the maximum/minimum record statistics:

print(count_max_records(yearly_weather["avg_temperature"]))
print(count_min_records(yearly_weather["avg_temperature"][::-1]))

By incident this will give us 12 for both statistics. Now we can use the empirical cumulative distribution function (ECDF) from above to estimate the probability of seeing 12 or more records:

1 - ecdf(11)

This will give us a probability of 0.00129 – i.e. the null hypothesis of no climate change is very unlikely to be true, so we discard it in favor of "Yes, the Climate Change is real".


Conclusion

In this follow up to the series of "Researching the Climate Change with Python and Spark" we applied a rather simple, yet powerful statistical test which allowed us to rigorously discard the null hypothesis of "no significant temperature increase" (at least in Germany).

The chosen statistics (running maximum) might be surprising since it does not give us any quantitative information on how fast the temperature is increasing. But it provides a very strong indication that the effect is real. This type of test is typically used when the numeric values do not matter. For example, you might apply this test to the series of world records in a specific Olympic discipline to study if the performance of the top athletes is also increasing over time (which probably is the case), which would explain a higher number of new records than one would expect otherwise.


Related Articles