Hands-on Tutorials

Democratizing Historical Weather Data Analysis with R

A guide for gardeners, someone moving to a different climate, or any other reason one might be interested to analyze weather data

Sabi Horvat
Towards Data Science
9 min readFeb 17, 2021

--

Photo by Egor Myznik on Unsplash

Motivation

As a Data Scientist, I was interested in finding weather data to perform a regression analysis on sales. That is to say, I wanted to understand if weather had a measurable impact on online or in-store sales. I’ve researched similar business questions related to weather in the past ten years, and my hope is that the following tutorial can help others with similar questions.

When I started gardening and choosing the right plants for an area, I found the following maps to provide useful information as a rule of thumb. And now that my partner and I own land around our residence and can invest in our gardening ambitions, I dove deeper into what these statistics mean and where the data comes from. With today’s open data, it is possible to calculate these statistics for a postal code (ZIP code) near you. I find the data analysis fun and more accurate than using a magnifying glass on printed-out maps. To have the data available for answering specific questions can also be helpful.

  • USDA Plant Hardiness Zone Maps—the most common map I’ve seen at garden centers and nurseries in the US; used for selecting plants that will survive the winter
  • Chill Hours Maps — if you want to plant peach trees, cherry trees, and other plants that require minimum chill hours in to set fruit
  • US Precipitation Maps —rainfall affects many considerations when planning a landscape, even if you have a consistent water source
  • Last Spring Freeze — The Farmer’s Almanac and the region’s gardeners and farmers are also a good resource to understand when it might be okay to start planting in the spring.
  • Frost Depth Maps — if you plan to dig in fence posts or irrigation lines, it is helpful to know the expected frost depth on the land.
  • Plant Heat Zone Maps —excessive heat for certain plants can be a problem.

The Data

For the following plots, daily precipitation and temperature data has been obtained from the NOAA National Centers for Environmental Information (Climate.gov), via the online search & download option as a comma separated value (CSV) file. The data has the following columns:

  • DATE = YYYY-MM-DD
  • STATION = Weather Station ID
  • NAME = Description of the Weather Station
  • PRCP = Precipitation in rainfall (inches)
  • SNOW = Precipitation in Snowfall (inches)
  • TMIN = Daily Minimum Temperature

This data could be analyzed in a spreadsheet, but if you want the process to be repeatable with the click of a button for any postal code, writing a program with R may be a better way.

If you would like to follow along, you may copy my GitHub “weather” repository (towardsDS folder) with a free GitHub account. Or you can download the CSV file directly if you go to this link and Save Page As…

Quick Tips for R and RStudio (Including Installation)

If you already use R, you may skip this section. For new useRs, the process for installation is documented on many sites, but this should point you in the right direction. For data analysis and modeling, I recommend using RStudio with R.

  1. Download and install R from the CRAN library, which you can find at r-project.org/ . Select the CRAN mirror of your choice for the download.
  2. Download and install RStudio from the RStudio download page. The free desktop version is the best way to start.
  3. Open RStudio to create a new file, and choose File > New File > R Script if you’d like to follow along. R Markdown and the other options are also great, but in this tutorial we’ll use a simple R Script.
  4. Alternatively, when you open an existing file with RStudio if RStudio is not yet open, RStudio will automatically set your directory path to the path of that file.
  5. In RStudio, click on Tools > Install Packages… to install the libraries used in the script.
  6. To execute one line of code at a time, position the cursor on that line and press Command+Enter on a Mac or Control+Enter on a PC. If the code extends to another line, using a pipe or until a closed parenthesis, the entire code block is executed and the cursor is moved.
  7. To view or copy the entire R script instead of the snippets below, visit the full R script on Github.

Historical Weather Analysis: Rain and Snow

First, import the tidyverse library and use its read_csv() function to import the weather data. The tidyverse is a collection of libraries that includes dplyr, tidyr, ggplot2, and others. To learn more about the tidyverse, please read R for Data Science.

The other libraries in this script provide functions for data wrangling with dates (lubridate) and functions for customizing plots with the grammar of graphics library ggplot2 (ggthemes, ggtext).

Output:

spec_tbl_df [29,342 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ STATION: chr [1:29342] "USW00024229" "USW00024229" "USW00024229" "USW00024229" ...
$ NAME : chr [1:29342] "PORTLAND INTERNATIONAL AIRPORT, OR US" "PORTLAND INTERNATIONAL AIRPORT, OR US" "PORTLAND INTERNATIONAL AIRPORT, OR US" "PORTLAND INTERNATIONAL AIRPORT, OR US" ...
$ DATE : Date[1:29342], format: "1940-10-14" "1940-10-15" ...
$ PRCP : num [1:29342] 0 0 0 0.13 0 0 0.14 0.05 0 0.63 ...
$ SNOW : num [1:29342] 0 0 0 0 0 0 0 0 0 0 ...
$ TMIN : num [1:29342] 53 52 50 58 58 59 54 48 41 53 ...
- attr(*, "spec")=
.. cols(
.. STATION = col_character(),
.. NAME = col_character(),
.. DATE = col_date(format = ""),
.. PRCP = col_double(),
.. SNOW = col_double(),
.. TMIN = col_double()
.. )

The weather data has now been imported into the data frame csv_data. A key point to check is the number of weather stations in data, as we’ll want to explore only one weather station at a time. The str() function shows the structure of the data frame and that there are at least four weather stations with data in this ZIP code.

In the next step, we’ll select one weather station for the rest of the analysis by filtering on a station and performing other data wrangling tasks separated by the pipes %>% syntax enabled by the tidyverse.

Output:

# A tibble: 1 x 3
NAME min_date max_date
* <chr> <date> <date>
1 PORTLAND INTERNATIONAL AIRPORT, OR US 1940-10-14 2021-02-12

The CSV file has data for this station from 1940–10–14 to 2021-02–12. Since we want to look at full years of data only, in the following code block the data is filtered to the range from 1941 to 2020.

Now that we have 80 years of data available, we can answer questions such as “What was the single day with the most amount of rainfall, and how many inches of rain were measured that day?”

Output:

# A tibble: 2 x 6
STATION NAME DATE PRCP SNOW TMIN
<chr> <chr> <date> <dbl> <dbl> <dbl>
1 USW00024229 PORTLAND INTERNATIONAL... 1943-01-21 1.1 14.4 19
2 USW00024229 PORTLAND INTERNATIONAL... 1996-11-19 2.69 NA 34

The highest recorded rainfall at the Portland, Oregon airport (PDX) weather station was 2.69 inches measured on 1996–11–19. The highest recorded snowfall occurred on 1943–01–21 with 14.4 inches measured.

My favorite way to understand the climate of an area is to view a box-and-whisker plot of monthly rainfall. You might be able to find this data for your area, or at least something similar like the Climate Charts on Wikipedia, but with R you have the ability to zero in on the data by ZIP code or any other granularity that may not have published charts. Additionally, you may want to know the time period of the data included for the graph and experiment with the impact of using different date ranges.

Output:

This image was created by the code block above it

Was using the last 80 years of data appropriate, or would using only the most recent 20 years be more reflective of the climate? For this particular ZIP code, the additional (or less) data didn’t change the plot significantly, but I chose to display the full 80 years since the extra outliers that show up on the graph are interesting.

Now let’s see a plot for snowfall. Since snowfall is less common in this ZIP code, the following plot is based on an annual snowfall.

Output:

This image was created by the code block above it

It is interesting to note that inspecting only the last 20 years of data versus the full 80 years of data would produce a very different result. In the last twenty years, measured snowfall was rare in this area, although the last five years have resumed trends of the previous century.

Historical Weather Analysis: Freezing Temperatures

Gardeners will keep an eye on the weather forecast, and keen gardeners will want to understand historical trends as well. An area of particular interest is the last freeze each spring. If seeds or starts are planted too early, they may succumb to a late frost. On the other end of the growing season, the first freeze each fall is a good indication of when to ensure the harvest is completed for crops that cannot withstand an early frost.

Output:

This image was created by the code block above it

Absorbing this data visually, it seems that the last freeze has been earlier in March in recent years. The local wisdom that states it is risky to plant before mid-April seems to be supported by the last 30 years of data. Although in some recent years, the soil has been too saturated with winter/spring rains to allow for planting earlier anyway.

Historical Weather Analysis: Plant Hardiness Zone

The last analysis that I’ll share is in regards to cold hardiness for plants. The USDA Plant Hardiness Zone Maps produced in collaboration with Oregon State University (OSU) show the average annual extreme minimum temperature from 1976–2005. While the amount of work to recreate such a map is extensive, we can recreate a more up-to-date data point for a particular ZIP code.

Most nurseries label each variety of tree, bush, and other perennial plants with the plant hardiness zone to indicate where each variety is likely to pull through the colder months into the next year. The 97218 ZIP code seems to fall into the coloring of region 8b, which means it is safe to plant varieties that are likely to survive at an annual extreme low temperature between 15 to 20 degrees Fahrenheit. Let’s compare the data used from 1976–2005 to newer data from 1991–2020 to see if there has been any significant change.

Output (1976–2005):

year           annual_extreme_low
Min. :1976 Min. : 9.00
1st Qu.:1983 1st Qu.:14.00
Median :1990 Median :19.00
Mean :1990 Mean :19.20
3rd Qu.:1998 3rd Qu.:24.75
Max. :2005 Max. :27.00

Output (1991–2020):

year           annual_extreme_low
Min. :1991 Min. :11.00
1st Qu.:1998 1st Qu.:18.00
Median :2006 Median :22.00
Mean :2006 Mean :20.77
3rd Qu.:2013 3rd Qu.:25.75
Max. :2020 Max. :27.00

The mean annual extreme low from the newer 30-year average is 1.6 degrees Fahrenheit warmer (now 20.8 compared to 19.2). This change, now above the cusp of 20, is enough to put this postal code into the 9a hardiness zone. Actually, microclimates within the region are already noted to be possibly in the 9a hardiness zone on the USDA/OSU map, if you look beyond the colors and notice the annotation. Perhaps this may be due to the heat island effect of the metropolitan area.

Learn and Share

I hope you’ve enjoyed this tutorial- and if this was your first time using R- I hope that this motivates you to learn more and democratize data analysis!

And if you would like to read (and write) more articles like this, please consider signing up for medium using my referral link: https://sabolch-horvat.medium.com/membership

[Updated 2021–12–18]: In “The Data”, added information on how to access the CSV data file directly from GitHub.

--

--

Supply Chain Data Scientist and Operations Researcher in Oregon 🇺🇸 🇭🇺