You are now reading the third and last part of my mini series on analyzing publicly available data to research the Climate Change. Still, the idea is not to become a real expert in meteorology but to apply common sense with appropriate tooling to derive some insights, such that everyone with some experience in programming and in working with data can draw their own conclusions. Democratizing data science, yeah!
As I wrote before, the capability of a broad audience to reproduce scientific insights via commonly available methodologies becomes more and more important in this world, where a growing number of decisions are increasingly based on data and mathematical models while at the same time social media simplifies to spread false information. Along with that idea, this article series is my take for researching climate change and thereby to combat all those false claims by following a more scientific approach – still without being a climate expert.
Source Code
Many details of processing steps are omitted in this article to keep the focus on the general approach. You can find a Jupyter notebook containing the complete working code on GitHub.
Outline
The whole journey from downloading some data until making some pictures is split up into three separate stories since each step already contains lots of information.
- Getting the data. The first part is about getting publicly available weather data and about extracting relevant metrics from this data. Depending on the data source, this turns out to be more complicated than one might guess.
- Preparing the data. The next step will focus on preparing the data in such a way that we can easily answer questions about average measures like temperature or wind speed per weather station and per country.
- Generating insights (you are just reading this part). Finally, in the last part, we will be able to perform some analysis based on the prepared data that will show the climate change.
1. Retrospective and final Steps
The first part of this series focused on getting some raw weather data from the National Oceanic and Atmospheric Administration (NOAA) and on transforming the data into a more convenient file format. The second part of the series focused on building an aggregated data set which is well suited for answering basic (but nevertheless relevant) questions about the overall weather for each country on a daily basis.

What is now left is to perform the analysis itself and to draw some conclusions. Specifically I will try to reproduce some graphs from the real weather experts in my country ("Deutscher Wetterdienst" in Germany)
So what will we learn this time?
- How to aggregate data with Pyspark (once more…)
- How to create meaningful and insightful visualizations of weather data
2. Prerequisites
I assume that you already followed the first part and the second part of this series, since we will be building upon the resulting Parquet files again. The good news is that we won’t generate new data sets this time, which would require more free space on your hard drive or SSD.
As the first step (and to warm up again), let’s read the final data set that we produced in the last article and display its schema:
# Read in data again
daily_country_weather =
spark.read.parquet(daily_country_weather_location)
# Inspect Schema
daily_country_weather.printSchema()

3. Yearly Average Weather
As I told you, we will rely on "common sense" and we will see where we get with this. So the very obvious idea to proof the climate change is to create some yearly plots – and that’s not the worst thing to do. But before we follow down this route, please let me restate the systematic shortcomings that we will have to accept:
3.1 Disclaimer
I am no weather expert, and therefore my methodology isn’t really correct. Specifically we are working with a huge data set containing weather measurements from weather stations around the world. But most weather stations have only a very limited life span, which means that the metering points change over time and therefore the weather information isn’t strictly comparable between different years having different sets of active weather stations. We also don’t weight the measurements of weather stations according to their distance to other weather stations (which could be an interesting idea) or other fancy stuff. We follow a plain stupid (but still educated) approach ignoring all these important details.
3.2 Yearly World Temperature
The simplest thing to do right now is to create a visualization which plots the global average temperature over all countries for all years. We will be using PySpark for the aggregation and Seaborn for the visualization:
# Aggregate average weather for all countries
yearly_weather = daily_country_weather
.withColumn("year", f.year(f.col("date")))
.groupBy("year").agg(
f.avg(f.col("avg_temperature")).alias("avg_temperature"),
f.avg(f.col("avg_wind_speed")).alias("avg_wind_speed"),
f.avg(f.col("max_wind_speed")).alias("max_wind_speed")
)
.orderBy(f.col("year")).toPandas()
# Create a plot with Seaborn (which as imported as sns)
plt.figure(figsize=(24,6))
sns.regplot(data=yearly_weather,
x="year", y="avg_temperature", color="r")

So we see a clear increase of the average global temperature within the data set, but of course common sense tells us that there is something wrong, since the increase of 15°C between 1920 and 1950 would have already exceeded any worst case scenario that could occur in the future.
So what could be possible explanations for what we are seeing here? Well, if we exclude any programming mistakes on our side and assuming correct implementations in all libraries, the effect has to reside inside the data set itself in combination with our approach. Remember the disclaimer about all the simplifications and the issues with a continuously changing set of weather stations? Let’s try to investigate if this could be a problem. Since we already lost the individual weather stations in the aggregate, let’s simply count the number of countries per year contained in the data set:
# Count number of distinct countries in the data set per year
yearly_countries = daily_country_weather
.withColumn("year", f.year(f.col("date")))
.groupBy("year").agg(
f.countDistinct(f.col("CTRY")).alias("num_countries")
)
.orderBy(f.col("year")).toPandas()
# Plot the number of countries per year
plt.figure(figsize=(24,6))
sns.relplot(data=yearly_countries,
x="year", y="num_countries", color="r", aspect=4)

This plot clearly shows that the data set has a very strong bias until ca 1975 because it contains far too few countries for being representative for the global weather around the whole earth.
This first investigation shows us that we are somewhat limited with drawing global conclusions about the climate change since the data set cannot be regarded as being "representative" for the weather for whole world. Of course things could be different if we had a different data set. But hope is not lost: Why focus on the whole earth when we could also look at an individual country? This is what we will be doing in the next steps.
4. Yearly Weather in Germany
Living in Germany for my whole life so I have some first hand experience with the weather of the last couple of decades. Now the obvious question is: Can data support my subjective impression that we indeed have increasingly warmer years with less snow in the winters? Let’s query our data set to find some insights.
4.1 Yearly Average Temperature
As the first step, we filter all records in the aggregated data set such that we only look at records belonging to a weather station in Germany. To do so, we first have to look up the FIPS country code of Germany, which is GM
. You might want to chose a different country code.
# Chose a FIPS country code for limiting our research to an
# individual country
country = "GM"
# Read in data again and filter for selected country
daily_country_weather =
spark.read.parquet(daily_country_weather_location)
.filter(f.col("CTRY") == country)
Now we can use the very same PySpark code to calculate the average temperature per year. This time we will instruct Seaborn to perform a linear regression and also plot the result into the same image.
yearly_weather = daily_country_weather
.withColumn("year", f.year(f.col("date")))
.groupBy("year").agg(
f.avg(f.col("avg_temperature")).alias("avg_temperature"),
f.avg(f.col("avg_wind_speed")).alias("avg_wind_speed"),
f.avg(f.col("max_wind_speed")).alias("max_wind_speed")
)
.orderBy(f.col("year")).toPandas()
# Plot average temperature per year, this time with a regression
plt.figure(figsize=(24,6))
sns.regplot(data=yearly_weather,
x="year", y="avg_temperature", color="r")

That looks much better. As we see, the data set doesn’t contain any information about Germany until ca 1925. You might have more or less luck with a different country. But although the overall average temperature fluctuates quite a bit, we can see a solid increase, especially during the last 25 years. Note that the average temperature reaches new maximum values and old minimum values are not reached any more. This coincides with my personal experience of longer summers and "warm" winters without much snow.
Of course, one could argue that the effect which I am seeing is again due to the data that I am using – I will come back to that completely valid and justified objection later.
4.2 Yearly Temperature Ranges
By using Seaborn, we can also easily create plots with confidence intervals of the temperatures in all years. This will give us more information than only the average.
df = daily_country_weather
.withColumn("month", f.month(f.col("date")))
.withColumn("year", f.year(f.col("date")))
.orderBy(f.col("date")).toPandas()
plt.figure(figsize=(16,4))
sns.regplot(data=df, x="year", y="avg_temperature",
x_estimator=np.mean, x_ci="sd", ci=100,
fit_reg=True, lowess=False,
line_kws={'color':'red'})

Again we see a clear increase, especially during the last 20 years.
4.3 Yearly Wind Speed in Germany
Since I also had the impression that we had an increasing number of storms during the last years (with very visible harm in the forest) I was also curious if I could find any support in the data – therefore I also included the wind speed in the PySpark query above.
plt.figure(figsize=(24,6))
sns.regplot(data=yearly_weather,
x="year", y="avg_wind_speed", color="b")
sns.regplot(data=yearly_weather,
x="year", y="max_wind_speed", color="r")

The resulting plot above is quite surprising for me: Average wind speed doesn’t change much and the maximum wind speed even decreases. But can we trust the plot? Honestly, I don’t think so: I am not a weather expert, but my gut feelings tell me that "average wind speed" is not a good metric in the first place. "Maximum wind speed" might be more interesting, but a maximum aggregation loses all information about frequency and lengths of time periods. And finally I am not sure if I interpreted the data correctly, the semantics of wind measurements seem to be more complex than for air temperature.
So "wind speed" needs some more research to either support or invalidate the plot above. But then again, can we trust the "temperature charts"? This will be investigated in the next section.
5. Monthly Average Weather
How do you know if you can trust your results? A very solid option is to compare your results with the results of other people, or even better with the results of other people who actually know what they are doing 🙂
In Germany the official authority for weather and climate is the "Deutscher Wetterdienst" (DWD) which actually has its head quarter in Offenbach which is only 10km away from my home. They also provide some plots of the development of the average temperature in Germany over time at https://www.dwd.de/DE/leistungen/zeitreihen/zeitreihen.html.

There is one interesting aspect of these plots: They are provided per month and not globally for the whole year. The plot above shows how the average temperature in January changed over time – again with a clear increase during the last 20 years. Actually the temperature changes over many years are not evenly distributed among different months, therefore it makes sense to create different plots per month. This is what we will be doing next and we’ll compare our results with the official plots from the DWD.
monthly_country_weather = daily_country_weather
.groupBy(
"CTRY",
"STATE",
f.year("date").alias("year"),
f.month("date").alias("month")
).agg(
f.avg(f.col("avg_temperature")).alias("avg_temperature"),
f.min(f.col("date")).alias("date"),
)
And now plot the results, which a different plot per month:
for m in range(1,13):
data = df[df["month"] == m]
plt.figure(figsize=(16,4))
plt.plot(
data["year"],
data["avg_temperature"],
label="Month " + str(m))
plt.legend()
sns.regplot(
data=data,
x="year",
y="avg_temperature")

These plots now finally offer us the possibility to compare our results with the ones from the officials experts at DWD. When you carefully compare the topmost plot with the plot from DWD, you will clearly see strong similarities, although details may vary. But personally I am more than satisfied the result given the simplistic approach.
As I said before, the trend of increasing average temperature is not uniformly affecting all months. But if you pick any of the warmer months from April until October, it becomes very apparent again:

Here you can clearly see, that the average temperature is increasing – of course there have been Septembers in the past that had a similar average temperature. But those were followed by rather cold or normal Septembers in the later years. Since around 2000 the situation is different and the minimum average temperature is steadily but clearly increasing without any "cold" September in-between.
6. Summary of Achievements
Now I reached the end of the actual analysis. Let me recapitulate the whole journey from the first part until the last images:
- We downloaded a huge publicly available data set containing weather data.
- We made sense of the raw data by extracting some relevant information from the complex file format.
- We cleaned up the data by carefully replacing suspect values with NULL values.
- We performed multiple aggregations to come up with a much smaller data set which allows us to ask simple questions about some core weather metrics (temperature, wind speed) for different countries and years.
- We finally used this data set to visualize the temperature in Germany which supports the thesis of a temperature increase.
- We compared our results with the ones of a well known authority in Germany and concluded that we were able to generate similar insights using common sense and some programming. Furthermore, I assume that the NOAA weather stations are distinct from the ones used by DWD.
For me, I am satisfied with the results so far. But as always, things could be improved.
7. Possible Improvements
Although I hope that you agree that we were able to achieve good results, there are still many details which could be improved or where some additional research would be required:
7.1 Metrics for Data Quality
When we first had a look at the average temperature for the whole world, we immediately saw that the given data set doesn’t contain enough information to draw reliable conclusions. Many countries were under represented until 1980, which renders the data set to be unusable for truly global investigations.
But what about the data for individual countries like Germany? The graphs looked okay, but we don’t know how many weather stations were used for individual years. We can assume that more weather stations will increase the overall data quality. A simple idea would be to count the number of distinct weather stations per day and use this information as a basic quality metric for measuring the reliability of any conclusions.
7.2 Wind Speed
As I already wrote above, the data contains information on wind speed, but our analysis leaves open many questions, specifically if averaging wind speed is a good metric. Drawing conclusions on wind speed needs more investigations, but would also be interesting.
7.3 Precipitation
Rain fall is another very important topic. In Germany, we have been experiencing a couple of years with unusually little precipitation during the summer months. If the situation doesn’t change in the next years, the lack of rainfall will impose many serious problems since water will become a scarce resource. Something which I didn’t believe would happen within Germany soon until the last year.
Unfortunately working with precipitation data is much more complex (you are not interested in the "average" rainfall, but in the total amount per time period instead) and I couldn’t understand the precise semantics of the data, even after consulting the format documentation.
Conclusion
Climate Change and specifically Global Warming is real. Period.
Having a somewhat scientific background, I learnt to trust the scientific discourse including peer reviews and incremental improvements in knowledge and understanding. Therefore I never doubted the existence of climate change since most (if not all) experts in that field are strongly convinced of its existence. Of course, an interesting and difficult question is to which degree the human race is responsible for this change – but as far as I know, the same experts are also convinced that this is the case.
But trusting experts is one thing – being able to retrace the results by oneself is even better. And this is even more important in highly debated areas like Climate Change and Global Warming, a topic which one the one side will have a huge impact for young and upcoming generations of humans but which on the other hand is fought out with false information for the sake of short-term profits.
I hope you enjoyed reading this article and maybe you even learnt something new about working with weather data, PySpark or where to find massive and meaningful amounts of data.