NOAA weather data in Snowflake (free)

NOAA GSOD’s daily worldwide weather data is updated daily in Snowflake, and in this post we’ll make it even more useful. Check inside for pivots, geo-joins, finding the closest station to each city, and pattern matching with MATCH_RECOGNIZE().

Felipe Hoffa
Towards Data Science

--

Video on Youtube

The source

To access the daily NOAA GSOD weather data, just go to the Marketplace and create a database with Knoema’s Environment Data Atlas in your account.

Knoema’s Environment Data Atlas in Snowflake

With a couple of exploratory queries you’ll notice that:

  • Having this data automatically refreshed in your account is cool!
  • Making this table useful is not that easy, because:
  • Stations have a lat, lon — but not a way to tell to which city or zip-code they belong.
  • The original NOAA rows for each day have been split into multiple rows for each day — with each row containing only one value for each different measure that a station can produce.
NOAA GSOD transformed by Knoema: One row for each measure

So let’s transform this data here, to make it easier to incorporate in your data pipelines.

Pivot multiple rows into one

This was easy with Snowflake. The following query will take the multiple rows per day per station, and transform them into only one:

--create or replace table noaa_gsod
--cluster by (station_id, date)
--as (
select *
from (
select "Stations", "Date", "Stations Name", "Country", "Indicator Name", "Value"
from KNOEMA_ENVIRONMENT_DATA_ATLAS.ENVIRONMENT.NOAACD2019R
where "Date">='2020-01-01'
and "Measure"='M1'
-- and "Country"='US'
)
pivot(max("Value") for "Indicator Name" in ('Mean visibility (miles)','Maximum temperature (Fahrenheit)','Mean dew point (Fahrenheit)','Maximum wind gust (Number)','Minimum temperature (Fahrenheit)','Maximum sustained wind speed (knots)','Mean wind speed (knots)','Mean station pressure (millibars)','Precipitation amount (inches)','Mean temperature (Fahrenheit)','Mean sea level pressure (millibars)','Snow depth (inches)'))
as p(station_id, date, name, country_fips, visibility, max, dew, wind_max, min, wind_sustained_max, wind_mean, pressure, rain, temp, pressure_sea, snow_depth)
--)
;

Since Snowflake supports native pivots, we were able to merge multiple rows into only one, creating new columns and renaming them for easier usage:

Pivoted NOAA GSOD by Knoema back into a useful wide table

Find cities

Now, how do we find the city closest to each station? Or the zip code?

Let’s start by finding a list of worldwide cities. In this case I used data from Wikidata (we can talk more about Wikidata on Snowflake in a future post).

Geolocate stations

Big cities with NOAA weather stations

With a list of cities and Snowflake’s GIS support we can find the city closest to each station:

create or replace table stations_city
as (
select label city, station, st_distance(st_makepoint(a.lon, a.lat), st_makepoint(b.lon, b.lat)) distance, country_fips, b.country
, station_id
from weather_stations a
join wikimedia.public.cities b
on a.country=b.country_fips
and st_distance(st_makepoint(a.lon, a.lat), st_makepoint(b.lon, b.lat)) < 50000
qualify row_number() over(partition by station order by distance) = 1
order by views desc
)
;

In that SQL query we limited the search of st_distance with a radius of 50km. This geo-join performed pleasantly fast: 3 seconds with a small warehouse.

Note that the qualify row_number() over(partition by station order by distance) = 1 brings us the city closest to each station. To get the station closest to each city we would need to change it to over(partition by city order by distance).

City closest to each station.

Pattern matching: Rainiest cities

Photo by Ewen Roberts

Now let’s have some fun with pattern matching, thanks to our new MATCH_RECOGNIZE powers.

For example let’s find the cities in the US that had the most number of consecutive days with rain:

   select *
from (select * from noaa_gsod where country_fips='US')
match_recognize(
partition by station_id, name
order by date
measures avg(temp) avg_temp, sum(rain) total_rain, min(rain) min_rain, count(*) rainy_days, min(date) date
one row per match
pattern(rain{5,})
define
rain as rain > 0.1
)

I love how easy is to define a pattern between rows with MATCH_RECOGNIZE:

  • First we define rain (for the pattern) as any row that has rain > 0.1 inches. If the rain that day didn’t hit that minimum, we won’t count it as a rainy day.
  • Then we ask Snowflake to look for any series of rows that see that definition of rain for at least 5 consecutive days with pattern(rain{5,}).
  • Check in the query how the combination of one row per match, partition by station_id, name order by date, and the measures min(date) date will help us get the desired results.

If we order the weather stations by most consecutive rainy days in a row, these are the results:

Stations in the US with the most consecutive # of rainy days

As you can see that Quillayute Airport saw 25 days of consecutive rain in January 2020, and then 16 days in December 2020. That’s a lot of non-stop rain.

But where is that airport? What’s the closest city? We can solve that with a join to our previous results:

select *
from (
select *
from (select * from noaa_gsod where country_fips='US')
match_recognize(
partition by station_id, name
order by date
measures avg(temp) avg_temp, sum(rain) total_rain, min(rain) min_rain, count(*) rainy_days, min(date) date
one row per match
pattern(rain{5,})
define
rain as rain > 0.1
)
) a
join (
select *
from stations_city
) b
using (station_id)
order by rainy_days desc, total_rain desc
;

Turns out these are the rainiest cities in the US. Can you guess before you look?

Rainiest cities in the US

Before looking, my guess was Seattle. And it was correct. Seattle followed by Tacoma, its twin city. They truly deserve the title “top rainy cities in the US, by number of consecutive rainy days”.

But where did Quillayute airport go? Turns out it’s not in the 50km radius to one of the major cities I had in my query above. If we grow that radius to 200km, these are the results:

Rainiest cities in the US (weather stations within 200km radius)

And there it is, Seattle and Tacoma again. It sure rains a lot around Washington state.

Everything together in dbt

Data lineage in this project, by dbt

Check out my GitHub project with a full dbt project.

(Thanks to Noel Gomez and Mike Weinberg for their review on the dbt Slack)

Next steps

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn, and check reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.