Handling NetCDF Files using XArray for Absolute Beginners
Exploring climate-related data manipulation tools in Python
--
NetCDF is a machine-independent, array-oriented, multi-dimensional, self-describing, and portable data format used by various scientific communities. It has a filename extension of .nc or .cdf (though it is believed that there are subtle differences between the two). Unlike files in .csv or .xlsx, NetCDF format cannot be accessed and viewed directly using spreadsheet software.
Even if you could, you would not do that on a 4-dimensional data with a bunch of metadata.
I will take climate data from Atmospheric Radiation Measurement Climate Research Facility (ARM) in the United States, and European Centre for Medium-Range Weather Forecasts (ECMWF) in Europe as an example.
Table of Contents
Originally published on my blog edenau.github.io.
Prerequisites
We will use xarray
library in Python for data processing. Long story short, it builds upon numpy
(and dask
) libraries and leverages the power of pandas
, but you probably don’t need to know about it. As you might know, package dependency is a pain in Python. That is why the most convenient way to get everything installed is to use the following command:
$ conda install xarray dask netCDF4 bottleneck
Experienced Python programmers are recommended check the relevant documentation for more details. If you are a beginner, no worries. I made a list of dependencies that you need to check:
- Python 2.7/3.5+ required
- numpy 1.12+ required
- pandas 0.19.2+ required
- scipy for interpolation features
- bottleneck for speeding up NaN-skipping
- netCDF4-python for basic netCDF operation such as reading/writing
- dask-array 0.16+ for parallel computing with dask
If you want to visualize your dataset, you will probably need these:
- matplotlib 1.5+ for plotting
- cartopy for maps
- seaborn for better colour palettes
For absolute beginners, you can check your default version of Python by
$ python --version
Python 2.7.5
You can also check if Python3 is installed by
$ python3 --version
Python 3.4.9
To check the version of packages, use pip freeze
or conda list
. Things should check out if you install xarray
through conda
.
Alternatives
iris
is an alternative to xarray
, but some works need to be done to make it work on Windows, and it does not work well on Mac OS. Iris is also an English word, so googling ‘iris’ gives you many irrelevant results. It was a pain for me to use iris
.
Data Preview
It is always a good idea to ‘preview’ and ‘get to know’ your data, its metadata and data structures. Assume you have installed netCDF4-python
and the only two commands you need are ncdump
and ncview
. The former gives text representation of your netCDF dataset (basically metadata and the data itself), while the latter is a very powerful graphical interface for instant data visualization.
ncdump
Go to the directory of your dataset and try
$ ncdump -h twparmbeatmC1.c1.20050101.000000.cdf
As we do not need to see the values of every data entry at the moment, -h
ensures only header (metadata) is shown. You will get
netcdf twparmbeatmC1.c1.20050101.000000 {
dimensions:
time = UNLIMITED ; // (8760 currently)
range = 2 ;
p = 37 ;
z = 512 ;
variables:
double base_time ;
base_time:long_name = "Base time in Epoch" ;
base_time:units = "seconds since 1970-1-1 0:00:00 0:00" ;
base_time:string = "2005-01-01 00.00, GMT" ;
base_time:ancillary_variables = "time_offset" ;
float prec_sfc(time) ;
prec_sfc:long_name = "Precipitation Rate" ;
prec_sfc:standard_name = "lwe_precipitation_rate" ;
prec_sfc:units = "mm/hour" ;
prec_sfc:missing_value = -9999.f ;
prec_sfc:_FillValue = -9999.f ;
prec_sfc:source = "twpsmet60sC1.b1" ;
float T_p(time, p) ;
T_p:long_name = "Dry Bulb Temperature, from sounding in p coordinate" ;
T_p:standard_name = "air_temperature" ;
T_p:units = "K" ;
T_p:missing_value = -9999.f ;
T_p:_FillValue = -9999.f ;
T_p:source = "twpsondewnpnC1.b1:tdry" ;// global attributes:
< OTHER METADATA >
}
You can see dimensions, variables, and other metadata which are quite self-explanatory. Global attributes (not printed above) tells us how the data is collected and pre-processed. In this example, they are measurement data taken at 147.4E 2.1S, Manus, Papua New Guinea by ARM.
When we look into the list of variables: 1-dim prec_sfc
and 2-dim T_p
, we realize that they have different dimensions(!). Precipitation rate is a scalar measurement at each time, whereas temperature is a column (measurements at different pressure levels instead of altitude levels this time) at every time. It is quite common to see 4-dim data in climate science — latitude, longitude, altitude/pressure level, time.
ncview
Try the following command and it gives you a graphical interface that lists all variables in your dataset, and it is quite straightforward.
$ ncview twparmbeatmC1.c1.20050101.000000.cdf
Terminology
DataArray
xarray.DataArray
is an implementation of a labelled, multi-dimensional array for a single variable, such as precipitation, temperature etc.. It has the following key properties:
values
: anumpy.ndarray
holding the array’s valuesdims
: dimension names for each axis (e.g.,('lat', 'lon', 'z', 'time')
)coords
: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dim arrays of numbers, DateTime objects, or strings)attrs
: anOrderedDict
to hold arbitrary metadata (attributes)
DataSet
xarray.DataSet
is a collection of DataArrays. Each NetCDF file contains a DataSet.
Coding using XArray
Data Import
You cannot play with the data until you read it. Use open_dataset
or open_mfdataset
to read a single or multiple NetCDF files, and store it in a DataSet called DS
.
import xarray as xr# single file
dataDIR = '../data/ARM/twparmbeatmC1.c1.20050101.000000.cdf'
DS = xr.open_dataset(dataDIR)# OR multiple files
mfdataDIR = '../data/ARM/twparmbeatmC1.c1.*.000000.cdf'
DS = xr.open_mfdataset(mfdataDIR)
Data Inspection
Remember the 4 key properties of DataArrays? You can useDS.values
, DS.var
, DS.dims
, DS.coords
, and DS.attrs
for data inspection. This will become very handy in interactive Python. Their functionalities are quite obvious and are left as an exercise to the reader(!).
DataArray Extraction
Extracting DataArrays from DataSet DS
is very straightforward, as DS.<var_name>
will suffice. You might consider dropping NaN entries by dropna()
and selecting data with sel()
. The method
parameter in sel()
allows us to enable the nearest neighbour (inexact) lookups by use of the methods 'pad'
, 'backfill'
, or 'nearest'
. To specify a range, use slice()
.
You can transform xr.DataArray
to numpy.ndarray
by da.values
.
# Extract Dry Bulb Temperature in z-coordinate (T_z)
# Select the altitude nearest to 500m above surface
# Drop NaN, convert to Celcius
da = DS.T_z.sel(z=500,method='nearest').dropna(dim='time') - 273.15 # or .ffill(dim='time')# Select data in 2000s
da = da.sel(time=slice('2000-01-01', '2009-12-31'))
da_numpy = da.values
It is a convention to name DataSet as DS
in upper case and DataArray as da
in lower case.
DateTime Operation
Assume DataArray da
has a dimension time
in DateTime format, We can extract the year/month/day/dayofyear/dayofweek by da.time.dt.<year/month/day/...>
. Note that the output is still in DataArray.
The following example takes one step further and tries to compute the mean/sum of any variable for each month. We first define a new coordinate system with assign_coords()
. Why? Try looking into how year and month performs in DateTime. If we need the system to be aware of the differences between January 2000 and January 2001, we need both year and month to define a new coordinate, which we call it year_month
.
We can then group data by groupby('year_month')
based on our newly defined coordinate system, followed by mean()
or sum()
operations.
# Contract the DataArray by taking mean for each Year-Month
def mean_in_year_month(da):
# Index of Year-Month starts at Jan 1991
month_cnt_1991 = (da.time.dt.year.to_index() - 1991) * 12 + da.time.dt.month.to_index()
# Assign newly defined Year-Month to coordinates, then group by it, then take the mean
return da.assign_coords(year_month = month_cnt_1991).groupby('year_month').mean()da_1 = mean_in_year_month(da1)
DataArray Merging
We can merge multiple DataArrays using xr.merge()
. If you attempt to merge two variables with the same name but with different values, xr.MergeError
will be raised. This ensures xr.merge()
is non-destructive.
DS_new = xr.merge([da_1,da_2,da_3]).dropna(dim='year_month')
Plotting
You can simply take DataArrays as arguments of matplotlib.pyplot
methods. For absolute beginners, try plt.plot()
, or plt.scatter()
for line or scatter plots. Remember to show figures by plt.show()
, or save figures by plt.savefig()
.
If you need maps, cartopy
library can generate one easily.
import cartopy.crs as ccrs
import matplotlib.pyplot as pltda = DS.t_sfc# Draw coastlines of the Earth
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
da.plot()
plt.show()
With a few extra lines of codes, you can generate something like this:
Data Export
You can convert DataArray to numpy.ndarray
as explained earlier, or convert DataSet or DataArray to pandas.DataFrame
as illustrated below.
df = DS.to_dataframe()
You can also export DataArray or DataSet to NetCDF file by
dataDIR = '../data/new.nc'
DS.to_netcdf(dataDIR)
Remarks
If you are interested in Python or programming in general, the following articles might be helpful:
This tutorial was written for teaching purpose and only included the very basics of xarray
. Hope this helps!
Originally published on my blog edenau.github.io.