Handling NetCDF Files using XArray for Absolute Beginners

Exploring climate-related data manipulation tools in Python

Eden Au
Towards Data Science

--

NetCDF is a machine-independent, array-oriented, multi-dimensional, self-describing, and portable data format used by various scientific communities. It has a filename extension of .nc or .cdf (though it is believed that there are subtle differences between the two). Unlike files in .csv or .xlsx, NetCDF format cannot be accessed and viewed directly using spreadsheet software.

Even if you could, you would not do that on a 4-dimensional data with a bunch of metadata.

I will take climate data from Atmospheric Radiation Measurement Climate Research Facility (ARM) in the United States, and European Centre for Medium-Range Weather Forecasts (ECMWF) in Europe as an example.

Table of Contents

Originally published on my blog edenau.github.io.

Prerequisites

We will use xarray library in Python for data processing. Long story short, it builds upon numpy (and dask) libraries and leverages the power of pandas, but you probably don’t need to know about it. As you might know, package dependency is a pain in Python. That is why the most convenient way to get everything installed is to use the following command:

$ conda install xarray dask netCDF4 bottleneck

Experienced Python programmers are recommended check the relevant documentation for more details. If you are a beginner, no worries. I made a list of dependencies that you need to check:

  • Python 2.7/3.5+ required
  • numpy 1.12+ required
  • pandas 0.19.2+ required
  • scipy for interpolation features
  • bottleneck for speeding up NaN-skipping
  • netCDF4-python for basic netCDF operation such as reading/writing
  • dask-array 0.16+ for parallel computing with dask

If you want to visualize your dataset, you will probably need these:

  • matplotlib 1.5+ for plotting
  • cartopy for maps
  • seaborn for better colour palettes

For absolute beginners, you can check your default version of Python by

$ python --version
Python 2.7.5

You can also check if Python3 is installed by

$ python3 --version
Python 3.4.9

To check the version of packages, use pip freeze or conda list. Things should check out if you install xarray through conda.

Alternatives

iris is an alternative to xarray, but some works need to be done to make it work on Windows, and it does not work well on Mac OS. Iris is also an English word, so googling ‘iris’ gives you many irrelevant results. It was a pain for me to use iris.

Data Preview

It is always a good idea to ‘preview’ and ‘get to know’ your data, its metadata and data structures. Assume you have installed netCDF4-python and the only two commands you need are ncdump and ncview. The former gives text representation of your netCDF dataset (basically metadata and the data itself), while the latter is a very powerful graphical interface for instant data visualization.

ncdump

Go to the directory of your dataset and try

$ ncdump -h twparmbeatmC1.c1.20050101.000000.cdf

As we do not need to see the values of every data entry at the moment, -h ensures only header (metadata) is shown. You will get

netcdf twparmbeatmC1.c1.20050101.000000 {
dimensions:
time = UNLIMITED ; // (8760 currently)
range = 2 ;
p = 37 ;
z = 512 ;
variables:
double base_time ;
base_time:long_name = "Base time in Epoch" ;
base_time:units = "seconds since 1970-1-1 0:00:00 0:00" ;
base_time:string = "2005-01-01 00.00, GMT" ;
base_time:ancillary_variables = "time_offset" ;
float prec_sfc(time) ;
prec_sfc:long_name = "Precipitation Rate" ;
prec_sfc:standard_name = "lwe_precipitation_rate" ;
prec_sfc:units = "mm/hour" ;
prec_sfc:missing_value = -9999.f ;
prec_sfc:_FillValue = -9999.f ;
prec_sfc:source = "twpsmet60sC1.b1" ;
float T_p(time, p) ;
T_p:long_name = "Dry Bulb Temperature, from sounding in p coordinate" ;
T_p:standard_name = "air_temperature" ;
T_p:units = "K" ;
T_p:missing_value = -9999.f ;
T_p:_FillValue = -9999.f ;
T_p:source = "twpsondewnpnC1.b1:tdry" ;
// global attributes:
< OTHER METADATA >
}

You can see dimensions, variables, and other metadata which are quite self-explanatory. Global attributes (not printed above) tells us how the data is collected and pre-processed. In this example, they are measurement data taken at 147.4E 2.1S, Manus, Papua New Guinea by ARM.

When we look into the list of variables: 1-dim prec_sfc and 2-dim T_p, we realize that they have different dimensions(!). Precipitation rate is a scalar measurement at each time, whereas temperature is a column (measurements at different pressure levels instead of altitude levels this time) at every time. It is quite common to see 4-dim data in climate science — latitude, longitude, altitude/pressure level, time.

ncview

Try the following command and it gives you a graphical interface that lists all variables in your dataset, and it is quite straightforward.

$ ncview twparmbeatmC1.c1.20050101.000000.cdf
The graphical interface in Linux using ncview

Terminology

Data structures of xarray

DataArray

xarray.DataArray is an implementation of a labelled, multi-dimensional array for a single variable, such as precipitation, temperature etc.. It has the following key properties:

  • values: a numpy.ndarray holding the array’s values
  • dims: dimension names for each axis (e.g., ('lat', 'lon', 'z', 'time'))
  • coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dim arrays of numbers, DateTime objects, or strings)
  • attrs: an OrderedDict to hold arbitrary metadata (attributes)

DataSet

xarray.DataSet is a collection of DataArrays. Each NetCDF file contains a DataSet.

Coding using XArray

Data Import

You cannot play with the data until you read it. Use open_dataset or open_mfdataset to read a single or multiple NetCDF files, and store it in a DataSet called DS.

import xarray as xr# single file
dataDIR = '../data/ARM/twparmbeatmC1.c1.20050101.000000.cdf'
DS = xr.open_dataset(dataDIR)
# OR multiple files
mfdataDIR = '../data/ARM/twparmbeatmC1.c1.*.000000.cdf'
DS = xr.open_mfdataset(mfdataDIR)

Data Inspection

Remember the 4 key properties of DataArrays? You can useDS.values, DS.var, DS.dims, DS.coords, and DS.attrs for data inspection. This will become very handy in interactive Python. Their functionalities are quite obvious and are left as an exercise to the reader(!).

DataArray Extraction

Extracting DataArrays from DataSet DS is very straightforward, as DS.<var_name> will suffice. You might consider dropping NaN entries by dropna() and selecting data with sel(). The method parameter in sel() allows us to enable the nearest neighbour (inexact) lookups by use of the methods 'pad', 'backfill', or 'nearest'. To specify a range, use slice().

You can transform xr.DataArray to numpy.ndarray by da.values.

# Extract Dry Bulb Temperature in z-coordinate (T_z)
# Select the altitude nearest to 500m above surface
# Drop NaN, convert to Celcius
da = DS.T_z.sel(z=500,method='nearest').dropna(dim='time') - 273.15 # or .ffill(dim='time')
# Select data in 2000s
da = da.sel(time=slice('2000-01-01', '2009-12-31'))
da_numpy = da.values

It is a convention to name DataSet as DS in upper case and DataArray as da in lower case.

DateTime Operation

Assume DataArray da has a dimension timein DateTime format, We can extract the year/month/day/dayofyear/dayofweek by da.time.dt.<year/month/day/...>. Note that the output is still in DataArray.

The following example takes one step further and tries to compute the mean/sum of any variable for each month. We first define a new coordinate system with assign_coords(). Why? Try looking into how year and month performs in DateTime. If we need the system to be aware of the differences between January 2000 and January 2001, we need both year and month to define a new coordinate, which we call it year_month.

We can then group data by groupby('year_month') based on our newly defined coordinate system, followed by mean()or sum() operations.

# Contract the DataArray by taking mean for each Year-Month
def mean_in_year_month(da):
# Index of Year-Month starts at Jan 1991
month_cnt_1991 = (da.time.dt.year.to_index() - 1991) * 12 + da.time.dt.month.to_index()
# Assign newly defined Year-Month to coordinates, then group by it, then take the mean
return da.assign_coords(year_month = month_cnt_1991).groupby('year_month').mean()
da_1 = mean_in_year_month(da1)

DataArray Merging

We can merge multiple DataArrays using xr.merge(). If you attempt to merge two variables with the same name but with different values, xr.MergeError will be raised. This ensures xr.merge()is non-destructive.

DS_new = xr.merge([da_1,da_2,da_3]).dropna(dim='year_month')

Plotting

You can simply take DataArrays as arguments of matplotlib.pyplot methods. For absolute beginners, try plt.plot(), or plt.scatter() for line or scatter plots. Remember to show figures by plt.show(), or save figures by plt.savefig().

If you need maps, cartopy library can generate one easily.

import cartopy.crs as ccrs
import matplotlib.pyplot as plt
da = DS.t_sfc# Draw coastlines of the Earth
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
da.plot()
plt.show()

With a few extra lines of codes, you can generate something like this:

Plots generated by cartopy

Data Export

You can convert DataArray to numpy.ndarray as explained earlier, or convert DataSet or DataArray to pandas.DataFrame as illustrated below.

df = DS.to_dataframe()

You can also export DataArray or DataSet to NetCDF file by

dataDIR = '../data/new.nc'
DS.to_netcdf(dataDIR)

--

--