Data for Change
The Climate Change currently is a hot topic, with many experts claiming a significant increase of the average temperature over the whole world. Nevertheless some people don’t believe these experts and claim that the climate didn’t change, and other people question the influence of the human species on the current development.
While I am by no means an expert for climate or weather, I was wondering if I could follow the claims of an increase of the average temperature by analyzing appropriate data. Depending on the chosen data source, following this idea can be a technically challenging and insightful journey into weather data. In this article series I want to present my approach of using PySpark for analyzing ca 100GB of compressed raw weather data for reconstructing some relevant metrics substantiating the climate change.
The purpose of the article series actually is two-fold: Diving into working with weather measurements and providing a non-trivial example for using Pyspark with real data.
Source Code
Many details of processing steps are omitted in this article to keep focus on the general approach. You can find a Jupyter notebook containing the complete working code on GitHub.
Outline
Since the whole analysis from downloading the data until performing some analysis involves many steps, the whole journey is split up into three separate articles as follows
- Getting the data (you are just reading this part). The first part is about getting publicly available weather data and about extracting relevant metrics from this data. Depending on the data source, this turns out to be more complicated than one might guess.
- Preparing the data. The next step will focus on preparing the data in such a way that we can easily answer questions about average measures like temperature or wind speed per weather station and per country.
- Generating insights. Finally in the last part we will be able to perform some analysis based on the prepared data that will show the climate change.
Disclaimer
I am neither an expert for meteorology nor for climate models. I will try to use common sense to gather some insights. Eventually this was part of my idea: Everyone who has some experience with working with data should be able to perform some conclusions on the climate change given a detailed enough data set.
I will add some remarks where I know that things are not correct. As far as I know real expert would go down a completely different route and build a weather model from the given measurements and then interpolate the weather evenly across the whole earth. This is completely out of scope for me, we instead will use simple averages of the measurements per country.
1. What you will learn this time
By reading this first article in the series and even more by following the Jupyter notebook, you will learn several things:
- Where to get plenty of raw weather measurements
- How to use PySpark
- How to perform non-trivial data extraction on real world data
2. Prerequisites
Before we start with the actual work, I’d like to present the prerequisites required to follow the next steps.
2.1 Hardware Requirements
Whenever an article gives you recommendations about hardware, you should be warned. That’s not different in this case. We will be working with 100 GB of compressed raw data, which requires appropriate storage capacity. We will derive additional data sets from this raw data, which again needs free storage capacity. Plus we will be using PySpark which requires additional temporary disk space.
But disk space is not enough for processing 100GB compressed data, you also need CPU power and possibly RAM, which Spark can use. So I recommend the following hardware:
- At least 300GB free storage (100 GB raw data + 50GB derived data + 150GB scratch space for Spark).
- At least 16GB of RAM, the more the better.
- All the CPU cores you can get – Spark scales very well with many CPUs.
2.2 Software Requirements
For following all steps, you need to provide a Python environment containing PySpark and Jupyter. I use the Anaconda Python distribution for this task and you will find an appropriate environment.yml
file in the GitHub repository for creating a conda environment with all required dependencies. Once you cloned the repository and installed Anaconda (or Miniconda), you can create a self-contained Python environment with the following command
conda env create -n weather -f environment.yml
Then you can activate the environment and start the Jupyter Lab server via
conda activate weather
jupyter-lab --ip=0.0.0.0 --port=8888
3. Getting the Data
In order to perform some analysis on weather data, we first need to get some weather data. When you search the internet, you might find many different sources, which differ in various aspects:
- The amount of available history, i.e. how many years does the data go back in time.
- The geographic coverage, i.e. information for which countries and regions is available in the data set.
- The degree of preprocessing, i.e. does the data reflect the raw measurements of weather stations or is the data already aggregated or maybe even the result of a complex global weather model.
3.1 Choosing a Data Source
A couple of years ago, I found a very valuable source for precisely such data provided by the National Oceanic and Atmospheric Administration (NOAA) called the "Integrates Surface Database" (ISD).
The data set is really very impressive because of its detail and history:
- The data goes back to 1901
- It contains hourly weather measurements from thousands of weather stations around the world
- Among the measurements are air temperature, wind speed, precipitation, dew point, atmospheric pressure, cloudiness and much more


Originally I was searching for a non-trivial data set to be used in Spark workshops, but after some time I realized that this data is really a small treasure which could be used for more serious questions concerning weather and climate.
3.2 Downloading the Data
The data can be downloaded via ftp either from one of the two following URLs:
- ftp://ftp.ncdc.noaa.gov/pub/data/noaa /
- ftp://ftp.ncei.noaa.gov/pub/data/noaa/
Under each of these URLs you will find multiple sub directories for all the years since 1901. You will need all these yearly sub directories. In addition you will also need a file called isd-history.csv
which contains important meta information on all weather station. I also strongly suggest to download the documentation [isd-format-document.pdf](https://www.ncei.noaa.gov/data/global-hourly/doc/isd-format-document.pdf)
of the file format used for all weather measurements.
I highly recommend using a dedicated ftp Client providing a download queue and automatic retries. I used the venerable FileZilla for downloading all files:

Be patient when downloading the data, I think it took almost two days until my computer fetched all the 100GB of compressed data. (FTP isn’t the fastest protocol for many small files plus the NOAA servers only allow for two parallel connections per source IP address).
4. Extracting the Data
We will refine the data in multiple processing steps to make it more accessible for our use case. This approach is in line with the concepts of a Data Lake, except that we only use a single data source. The important take away here is that we
- … still have access to the original data, so we can always go back and restart all transformations from scratch.
- … first perform mainly a technical format conversion (although in our case, we already apply some business logic to keep things simpler).
- … refine and distill the data specifically for our question. Other questions might require other sorts of processing, which is completely fine with a Data Lake approach (as opposed to a classical Data Warehouse).
As we will see below, the measurements data is stored in a non-standard file format and therefore needs some work to make it easily accessible. This essentially involves classical data extraction patterns on a (relatively) massive amount of data. In order to avoid any limitations by the amount of available RAM for processing 100GB of raw data, we employ PySpark as our workhorse which provides us a scalable data processing framework.
4.1 Raw Data Organization
The original data set essentially is made up of two different record types:
- Measurements Data. This data is the core of the data set and contains billions of measurements from all weather stations. It is organized in different sub directories for each year, and each directory contains an individual file for each weather station.
- Master Data. In addition to the measurements, the data set also contains a special file containing generic information about each weather station, for example the country it belongs to, the geographic location and so on. This is what I call master data.
4.2 Converting Master Data
In order to warm up, let’s start inspecting the master data, as it is much easier to work with.
The master data contained in the file isd-history.csv
is a simple CSV file (as the extension already suggest). So we can easily read the file via PySpark standard functionality:
weather_stations = spark.read
.option("header", True)
.csv(weather_basedir + "/isd-history.csv")
weather_stations.limit(10).toPandas()

The master data contains several columns of importance for us
- The columns
USAF
andWBAN
contain weather station IDs. Since not all stations have a valid id for each of these coolumns, we can only assume the combination of both of these columns to be unique per weather station - The column
CTRY
contains the FIPS code of the country the weather station belongs to. Be aware of the fact that these country codes are not ISO codes, but FIPS codes instead: https://en.wikipedia.org/wiki/List_of_FIPS_country_codes - The columns
BEGIN
andEND
mark the life span of each weather station. Remember, we have data going back to 1901, so most weather stations will not exist for the whole time span. LAT
andLON
contain the geo coordinates of the weather station. We won’t use them.
We see that some column names contain some problematic characters like whitespaces and braces. Therefore we rename some columns and eventually save the result as Parquet files, which we will then use in later steps.
weather_stations = weather_stations
.withColumnRenamed("STATION NAME", "STATION_NAME")
.withColumnRenamed("ELEV(M)", "ELEVATION")
weather_stations.write.mode("overwrite").parquet(stations_location)
That was all for the master data. But beware, now come the weather measurements:
4.3 Extracting Measurement Data
Working with the measurements is much more difficult, even on a purely technical level (without interpreting the semantics), since the data is stored in a proprietary ASCII format. Let’s peek inside an arbitrary year:
raw_weather = spark.read.text(weather_basedir + "/2013")
raw_weather.limit(10).toPandas()

That doesn’t exactly look like something simple to work with. The details of the format are described in the isd-format-document.pdf
which is also available on the FTP servers. When you read the documentation, you’ll find out that the format actually is rather complex. Essentially the file is
- Fixed format (i.e. specific fields are stored at specific positions with specific lengths)
- Extended by optional sections
All data stored at fixed locations can easily be extracted (although it always takes some time to get things right), but the dynamic part is really difficult. Actually you will also find some Java source code on the FTP server to parse the data – but that is obviously not a trivial option in a Python environment. Fortunately the most interesting metrics like air temperature and wind speed is stored at fixed locations (precipitation is not, but we ignore this aspect for today):
- Position 6 to 10: USAF id of weather station
- Position 11 to 15: WBAN id of weather station
- Position 16 to 28: Date and time of measurement
- Position 61 to 63: Wind direction
- Position 64: Wind direction quality flag
- Position 66 to 69: Wind speed
- Position 70: Wind speed quality flag
- Position 88 to 92: Air temperature
- Position 93: Air temperature quality flag
In order to extract these values from each ASCII record, we simply use the PySpark SQL function substring
and cast the result to an appropriate data type. We also incorporate any required scaling for temperature and wind speed.
With this knowledge, we already can extract some important information (some columns are omitted in the code example below):
raw_weather.select(
f.substring(raw_weather["value"],5,6).alias("usaf"),
f.substring(raw_weather["value"],11,5).alias("wban"),
f.to_timestamp(
f.substring(raw_weather["value"],16,12),
"yyyyMMddHHmm").alias("ts"),
(f.substring(raw_weather["value"],88,5).cast("float") /
10.0).alias("air_temperature"),
f.substring(raw_weather["value"],93,1)
.alias("air_temperature_qual")
)
.withColumn("date", f.to_date(f.col("ts")))
.limit(10).toPandas()

You will find some additional code to extract precipitation data in the Jupyter notebook. Finally, a Python function is provided that extracts all the required columns (and some more) from the raw data:
def extract_weather_measurements(raw_weather):
df = raw_weather.select(
raw_weather["value"],
f.substring(raw_weather["value"],5,6).alias("usaf"),
f.substring(raw_weather["value"],11,5).alias("wban"),
# More code is omitted here
)
df = extract_precipitation(df, 1, 109)
df = extract_precipitation(df, 2, 120)
df = extract_precipitation(df, 3, 131)
df = extract_precipitation(df, 4, 142)
return df.drop("value")
Note that in addition to the measurements themselves we also extract quality indicators for each measurements (not shown in the code above). These indicators are also described in the official documentation and tell us if each metric of each measurement is valid or not. There are multiple different scenarios which result in partially invalid measurements:
- A weather station might not have all sensors. For example a certain weather station might only measure temperature but not wind speed.
- One of the sensors of a weather station might be broken while the other is still working fine.
- Different sensors are collected in different time intervals.
4.4 Processing the full History
So far we only focused on extracting relevant metrics from a single year, but of course we need to apply the logic the full history from 1901 until today. This is easily achieved by reading, transforming and writing each year seperately in a simple for loop.
We keep the pattern of having a separate directory for each year – this might come in handy for questions about a limited time range.
# Set this to `True` if you want to force reprocessing
force = False
for i in range(1901,2021):
source_dir = os.path.join(weather_basedir, str(i))
target_dir = os.path.join(hourly_weather_location,
"year=" + str(i))
if force or not path_exists(target_dir):
print(f"Processing year {i} to {target_dir}")
# Read in raw data
raw_weather = spark.read.text(source_dir)
# extract all measurments
weather = extract_weather_measurements(raw_weather)
# Repartition (i.e. shuffle) the data.
# This will ensure that all output files have a similar size
weather = weather.repartition(32)
# Write results as Parquet files
weather.write.mode("overwrite").parquet(target_dir)
else:
print(f"Skipping year {i} in {target_dir}")
Note that depending on the beefiness of your machine, this extraction and conversion might take several hours. But keep in mind, that we are processing 100GB of compressed data.
Conclusion
In this first part of the series, we mainly achieved two goals: We downloaded all the raw data from the NOAA ftp server and we extracted some important metrics from the raw files and stored them in Parquet files. So far we didn’t really perform any interpretation of the data, all steps merely were a technical format conversion. Such a first step is quite common to transform incoming data into a format that is well supported by the used processing framework – which in our case is Apache Spark.
Outlook
So far we didn’t investigate into the semantics of the data, we merely extracted some metrics without understanding their meaning. In the next part of this series we will move up one level from a purely technical conversion and add semantic processing steps performed on the Parquet files in order to further simplify working with the data in an analytical context.