How to Crack Open NetCDF Files in R and Extract Data as Time Series

Access to free climate data can be tricky — here’s how

María Bräuner
Towards Data Science

--

Frog smashing a NetCDF cube, a lineplot comes out of it. Illustration by María Braeuner.
Image by Author

A lot of things today are or can be analyzed with climate and weather data in mind: agriculture production, tourism, sales, emergency management, construction, migration, allergies, or your yearly ice cream munchies. But where does one get these data?

Lucky for us, many of those shiny orbiting specks of light we see in a cloudless night — satellites — are doing a lot of the hard work for us collecting data. Many of the initiatives behind these satellites even make these data free for all of us to use. But some freely available data might not be as easily accessible as one might think.

The first time I bumped into an .nc file was very frustrating. Initially, I was excited to find that the data I needed said “download for free”, but I almost immediately gave up on what I wanted to do because I had no idea how to read this file which had a weird '.nc' ending I had never seen before. An initial Google search proved pointless for me: lots of tutorials with MATLAB and Python — this was at a time when all I knew was some pretty basic R.

Eventually, I learned how to do it and many other things since (some tears were involved in the process), but I fell into the old impostor-brain trap of thinking “maybe I was the only one that didn’t know this”. Writing a tutorial for it never crossed my mind. Recently a friend had a similar issue with an nc file; I sent her my script and some days later she texted me: “Saved. You should make this public!”.

This tutorial is long overdue. So if you’re stuck in a similar way or you are also realizing that NetCDF files are here to stay: I hope this tutorial can be of help!

What is NetCDF?

Satellite data are a lot of data; a lot of multidimensional data (latitude, longitude, time, and a variable -or more!- of interest, like temperature).

Usually, we can see analysis depicting the mean temperature by day, week, month, or year…but the original raw data had a lot more than that. For example, some of the sensors that measure water temperature from the Envisat satellites collect data several times a day. These are then treated to account for cloudiness and other quality control aspects and give you one “clean” value per day. Normally you wouldn’t have to worry about this part (if you’re interested, though, data available for download usually has additional documentation of how it was cleaned).

But the resulting data is still a lot of data, and a lot of data equals heavy file sizes. So satellite data must be packed in small-sized files for them to be easily downloadable and accessed by people like you and me. Enter NetCDF.

NetCDF files are the standard of the Open Geospatial Consortium. It stands for “Network Common Data Form” (although, if you ask me, calling it “common” was a bit of a stretch). NetCDF files allow all of this data to fit in a small-sized file we can easily download. It is true, though, that it is more and more common to store climate, other environmental, and bioinformatics data in NetCDF files. So we might as well get used to using them!

There is software that makes exploring NetCDF files seemingly simpler, but they also have limitations. Some are only for Windows, require licenses, or need you to install additional programs while some are strictly for visualizing data. But sometimes you might need to do different things with these types of data, maybe take a time-series dataset of temperature to use with your own business or research data and these other options cannot help you. Accessing NetCDF files with languages like R gives you a lot more control and customization options over what you can do with the available data.

As stated in Unidata, these are the main characteristics of NetCDF files:

  • Machine-independent (whether you use a Windows or a Mac; whether you want to explore nc files with R, Julia, Python, or MATLAB — YOU CAN!)
  • Multi-dimensional array-oriented data (also known as raster data or gridded data)
  • Self-describing (file metadata and attributes are included within the file itself)
  • Portable (accessible by any operating system independent of how they store and read data)

Data Sources

Depending on which part of the world you need the data from, sometimes it might be trickier than others to obtain good and reliable datasets. Fortunately, you can download a lot of good data for free from Copernicus, CEDA, or the NOAA (and many more sources). Air temperature, land temperature; forest cover, vegetation phenology; marine, lake, and river temperature; the list goes on. Most of them use the NetCDF format (files ending in .nc).

For this tutorial, I’ll be using data from GloboLakes. I specifically downloaded the data for Lake Atitlán, which contains daily observations of Lake Surface Water Temperature (LSWT) from 1995 to 2016 from a combination of different radiometers on Envisat and ERS-2 satellites (if interested, there are more details on how they are combined and harmonised on the links provided). I’ll guide you through how to access the data and turn it into a time series in a “normal” data.frame and .csv file.

Let’s get started: how to open that file

I’m assuming this is not the first time you use R and you already set your work directory setwd() to where you saved your .ncdownload. You can access the full R script of this tutorial here.

First, load the packages we’ll need:

# ncdf4: this package lets you manipulate netCDF files. Unlike the ncdf package, the ncdf4 supports both NetCDF3 and NetCDF4 formats.> library(ncdf4)> library(tidyverse) # because who can live without the tidyverse?

We will use the nc_open function to open your .ncfile. We will store the file in our_nc_data vector and immediately print it to explore what’s in it (remember, getting to know your data is an essential first step in any analytics process!).

> our_nc_data <- nc_open("/your_path/file.nc")> print(our_nc_data)

You should get a list of variables and their details, like this:

File GloboLakes/LAKE00001479-GloboLakes-L3S-LSWT-v4.0-fv01.0.nc
6 variables (excluding dimension variables):
short lake_surface_water_temperature[lon,lat,time]
_FillValue: -32768
units: Kelvin
scale_factor: 0.0099999
add_offset: 273.1499938
long_name: lake surface skin temperature
valid_min: -200
valid_max: 5000
comment: The observations from different instruments have been combined.
standard_name: lake_surface_water_temperature
short lswt_uncertainty[lon,lat,time]
<its metadata similar to the first one above>
byte quality_level[lon,lat,time]
<its metadata similar to the first one above>
byte obs_instr[lon,lat,time]
<its metadata similar to the first one above>
byte flag_bias_correction[lon,lat,time]
<its metadata similar to the first one above>
int lakeid[lon,lat]
<its metadata similar to the first one above>

Our six variables are lake surface water temperature (lswt), lswt uncertainty, quality level, instrument (obs_instr), flag bias correction, and LakeID.

It should also have printed a list of dimensions and a list of global attributes:

     3 dimensions:
lat Size:23
long_name: latitude
standard_name: latitude

units: degrees_north
valid_min: -90
valid_max: 90
axis: Y
reference-datum: geographical coordinates, WGS84 projection
lon Size:24
long_name: longitude
standard_name: longitude

units: degrees_east
valid_min: -180
valid_max: 180
axis: X
reference-datum: geographical coordinates, WGS84 projection
time Size: 2622 *** is unlimited ***
long_name: reference time of the lswt file
standard_name: time
units: seconds since 1981-01-01 00:00:00
calendar: gregorian
49 global attributes:
<more metadata that you can explore on you own>

Take a minute to read through the information and get to know your data. Notice that, for this GloboLakes file:

  • We have 6 variables (for this tutorial we’ll focus on the first one, the lake_surface_water_temperature[lon,lat,time]; notice the units are in Kelvin).
  • Notice the details of the 3 dimensions (lat = 23, lon = 24, time = 2622; the latitude and longitude are in degrees North and East, and time units are in seconds since 1981–01–01 00:00:00 Gregorian calendar)
  • _FillValue is how a missing value is represented (we will need to change it later to the “NA” that R recognizes as missing values).
  • You can explore the global attributes on your own (basically more metadata)

Since we downloaded daily LSWT data for one lake and we now know more about our data, we can safely say: we have data for 2622 days of averaged LSWT in Kelvin for slices of 24 by 23 (lon x lat) for Lake Atitlan. I think it’s easier with a visual:

Visual representation of the 4D data inside an nc file, depicting latitude, longitude, time and temperature for this case.
Visual representation of the data in the ncdf file. Image by Author.

Let’s get our coordinate and time dimensions out as their own variables

Now that you know what is in your file, it’s easier to know how to call out each variable and which variables you actually need. To make sure you are calling the right variables, you can use the function attributes() to get the names of the variables and dimensions in the file. Then we will extract latitude, longitude, and time into their own objects with the function ncvar_get(). We will also extract some of their attributes with the function ncatt_get().

> attributes(our_nc_data$var)
> attributes(our_nc_data$dim)
# Get latitude and longitude with the ncvar_get function and store each into their own object:> lat <- ncvar_get(our_nc_data, "lat")
> nlat <- dim(lat) #to check it matches the metadata: 23
> lon <- ncvar_get(our_nc_data, "lon")
> nlon <- dim(lon) #to check, should be 24
# Check your lat lon dimensions match the information in the metadata we explored before:> print(c(nlon, nlat))# Get the time variable. Remember: our metadata said our time units are in seconds since 1981-01-01 00:00:00, so you will not see a recognizable date and time format, but a big number like "457185600". We will take care of this later> time <- ncvar_get(our_nc_data, "time")
> head(time)
# just to have a look at the numbers
> tunits <- ncatt_get(our_nc_data, "time", "units")
#check units
> nt <- dim(time)
#should be 2622

Are you still with me?

Now let’s extract the surface temperature data

We keep using the ncvar_get() function to get the variable and ncatt_get() to get attributes.

#get the variable in "matrix slices"
> lswt_array <- ncvar_get(our_nc_data, "lake_surface_water_temperature")

> fillvalue <- ncatt_get(our_nc_data, "lake_surface_water_temperature", "_FillValue")
> dim(lswt_array) #to check; this should give you 24 23 2622#right away let's replace the nc FillValues with NAs
> lswt_array[lswt_array==fillvalue$value] <- NA
> lswt_array

We’re done with extracting!

So far so good! But we still have unreadable (at least not human-readable) times. Also, we are still dealing with a 4D situation, with variables in their own objects, and to get time series we still need to flatten it down to a 2D “normal” or more convenient data frame.

Human-readable time

If we print our tunits it reminds us that our time variable is in “seconds since 1981–01–01 00:00:00”. But if we print time we still see a number like “457185600”. We can change it into a “normal” year-month-day format:

> time_obs <- as.POSIXct(time, origin = “1981–01–01”, tz=”GMT”)> dim(time_obs) #should be 2622> range(time_obs)
[1] "1995-06-28 12:00:00 GMT"
[2] "2016-12-31 12:00:00 GMT"
#Looks good

Great! Now instead of a crazy long number, you know your data are within what your metadata said they should be: between 1995 and 2016.

Now let’s have a look at the lswt_array you created earlier

Each “slice” of your array is a point in time of the 23 latitude and 24 longitude points, each combination of lat-lon-time with its own temperature record in Kelvin (printing dim(lswt_array)should return “24 23 2622”).

Visual representation of the data slices of our file.
Visual representation of the data inside your .nc file (the lake silhouette, dates, and other specific values apply only if you’re using the same GloboLakes file I am). Note what I mean by “slice”. Image by Author

If you print lswt_arrayyou’ll see a “slice” like a matrix. In this particular dataset, you’ll also notice there are A LOT of missing values (NAs). Could it be that it’s just the first 2 that get printed? You can print different individual slices with the following code, where the information within brackets should be: [lon, lat, slice] and slice is a number between 1 and 2622 (the size of your time variable, in other words, the number of days in your dataset).

# try it out:> lswt_slice <- lswt_array[ , , 2123] 
> lswt_slice <- lswt_array[ , , 25]

# and why not, draw it out:
> image(lon, lat, lswt_slice)

After trying several random slices between 1–2622, I noticed there are a lot of missing values around the center, so I double-checked the range of lat and lon in which the lake exists and the range the dataset includes. It turns out that a lot of space in the grid of our data is over land, not over the lake, so naturally, there is no water temperature data for those points.

So I did a quick manual check on Google Maps and the lake is roughly between 14.61N–14.75N and 91.3W-91.10W. Our entire data exceeds this area.

To make our final file smaller, we will get rid of those extra points. Just not quite yet. We still don’t have a data frame, just 4 separate variables or objects. If we remove the NAs now from the lswt_array, we won’t be able to know which of the remaining values belong to which lon, lat, and time later on.

Let’s build that data frame

First, we create the whole matrix so as not to lose the matching time and temperatures:

#Create 2D matrix of long, lat and time> lonlattime <- as.matrix(expand.grid(lon,lat,time_obs)) # this might take several seconds#reshape whole lswt_array> lswt_vec_long <- as.vector(lswt_array)
> length(lswt_vec_long)
# by now it should be 1447344
#Create data.frame> lswt_obs <- data.frame(cbind(lonlattime, lswt_vec_long))

Your lswt_obs should look like this now:

> head(lswt_obs)
Var1 Var2 Var3 lswt_vec_long
1 -91.775 14.125 1995-06-28 12:00:00 <NA>
2 -91.725 14.125 1995-06-28 12:00:00 <NA>
3...

That’s okay. But let’s give those columns proper names.

> colnames(lswt_obs) <- c(“Long”,”Lat”,”Date”,”LSWT_Kelvin”)> head(lswt_obs)
Long Lat Date LSWT_Kelvin
1 -91.775 14.125 1995-06-28 12:00:00 <NA>
2 -91.725 14.125 1995-06-28 12:00:00 <NA>
3...

Now we’re getting there. In fact, this already has the data you need, but it still has a lot of unnecessary stuff and it would make for a huge .csv file that we do not need.

Let’s finish cleaning this up to a useful data frame

As I said before, we have a lot of empty spaces (NAs) over land. But if you tried several lswt_slices, you saw that we also have some lake spaces in the grid with missing values (NAs) on different dates. Since our goal today is to get a time series data frame, the way I see it we can go two ways from here (there might be more, let me know in the comments if you would have taken a different approach!):

Option 1: select one specific pair of coordinates [lat,lon] and just keep the temperature data over time for that specific location, delete everything else (or do this with several pairs of coordinates if you wish). With this option, you could have several time series for different coordinates (each with a different amount of NAs that you’d have to decide how to deal with later, depending on what you’ll be doing with the data).

Option 2: just delete all NAs (not caring whether you are removing those NAs that are above water) and later take the mean of all the available temperature data by date to have a mean-LSWT per day over the entire available lake surface area. Some dates will have more values than others.

The end result would be the same (a data frame with date and temperature), but your temperature values may vary slightly.

Visual representation of the Option 1 and Option 2 described in the text.
Visual representation of the options. Image by Author

For this tutorial, I’ll go with Option 2. Remember: as long as you “come clean” with everything you did to the data on your final documents so it can be interpreted appropriately and reproduced, (almost) everything is valid!

# Bold move time: remove all rows with NA in LSWT_Kelvin:
> lswt_final <- na.omit(lswt_obs)

Phew! That’s a lot fewer data! You can check that out withdim():

> dim(lswt_obs)
1447344 4
> dim(lswt_final)
15308 4

We still have over 2622 Dates (15308) since we still have several temperature points on the grid on the same day (also, 2622 are not the total number of days between 1995–2016 because satellites also have a life — or difficulties).

By this point of building our time series dataset, we don’t have a use for latitude and longitude anymore, since we will average the temperature by date anyway. Let’s work this all out:

# remove lat & long columns
> lswt_final <- lswt_final[-c(1:2)]
> lswt_final
Date LSWT_Kelvin
<chr> <chr>
1 1995-06-28 12:00:00 296.63999

Well done. But now you can see under the column names (Date and LSWT_Kelvin) there is a <chr> reminding us that they’re both stored as characters or strings (if you don’t see them you can always double-check your data types with glimpse()). Let’s fix that because we cannot take the mean()of a string!

> lswt_final$Date <- as.Date(lswt_final$Date)
> lswt_final$LSWT_Kelvin <- as.double(lswt_final$LSWT_Kelvin)

Awesome. Now let’s get that mean daily temperature:

> lswt_final <- lswt_final %>%
group_by(Date) %>%
summarize(Mean_K = mean(LSWT_Kelvin))

> lswt_final
Date Mean_K
<date> <dbl>
1 1995-06-28 297.79
2 1995-07-11 297.07
3 1995-07-17 296.99
4 ...
> dim(lswt_final)
2622 2
# that seems about right!!!

Woohoo! Basically, you’re done, but let’s give it a final touch and change Kelvin into degrees Celsius.

> lswt_C <- lswt_final %>%
mutate(LSWT_C = Mean_K-273.15)

And there you have it! You can now use it or save it in a .csv file or in whichever form you need it.

# save as csv 
> write.csv(as.data.frame(lswt_C), "GloboLakes_Atitlan_TS_95_16.csv", row.names=T)

Here’s a simple line plot of what you have now (keep in mind this data still needs some cleaning):

Line plot. Time series of LSWT in degrees Celsius for Lake Atitlán 1995–2016
Image by Author

Now you can go on and do whatever you need with the data!

Access the full script here.

Don’t fear the NetCDF files!

So don’t let a mysterious unpopular file you’ve never met before scare you away from finishing that project you had in mind if your language of choice and expertise is R. Go crack open those nc files now!

I tried to make this a condensed tutorial to get you a basic time series for your further cleaning and analysis, but if you’re hooked and would like to know more theory behind nc files and other things you can do, here are the references I used and other recommended further reading:

--

--

Guatemalan nerd & incurable generalist. Biologist by training, science communicator by calling. Conservationist by conviction.