Basic data structures of xarray

Udi Yosovzon
Towards Data Science
7 min readSep 5, 2019

--

Photo by Faris Mohammed on Unsplash

Xarray is a python package for working with labeled multi-dimensional (a.k.a. N-dimensional, ND) arrays, it includes functions for advanced analytics and visualization. Xarray is heavily inspired by pandas and it uses pandas internally. While pandas is a great tool for working with tabular data, it can get a little awkward when data is of higher dimension. Pandas’ main data structures are Series (for 1-dimensional data) and DataFrame (for 2-dimensional data). It used to have Panel (for 3-dimensional data) but it was removed in version 0.25.0.

The reader is assumed to be familiar with pandas, if you do not know what pandas is, you should check it out before xarray.

Why should I use ND arrays?

In many fields of science, there’s a need to map a data point to various properties (referred to as coordinates), for example, you want to map a certain temperature measurement to latitude, longitude, altitude and time. This is 4 dimensions!

In python, the fundamental package for working with ND arrays is NumPy. Xarray has some built-in features that make working with ND arrays easier than NumPy:

  • Instead of axis labels, xarray uses named dimensions, which makes it easy to select data and apply operations over dimensions.
  • NumPy array can only have one data type, while xarray can hold heterogeneous data in an ND array. It also makes NaN handling easier.
  • Keep track of arbitrary metadata on your object with obj.attrs .

Data structures

Xarray has two data structures:

  • DataArray — for a single data variable
  • Dataset — a container for multiple DataArrays (data variables)

There’s a distinction between data variables and coordinates, according to CF conventions. Xarray follows these conventions, but it mostly semantic and you don’t have to follow it. I see it like this: a data variable is the data of interest, and a coordinate is a label to describe the data of interest. For example latitude, longitude and time are coordinates while the temperature is a data variable. This is because we are interested in measuring temperatures, all the rest is describing the measurement (the data point). In xarray docs, they say:

Coordinates indicate constant/fixed/independent quantities, unlike the varying/measured/dependent quantities that belong in data.

Okay, let’s see some code!

# customary imports
import numpy as np
import pandas as pd
import xarray as xr

First, we’ll create some toy temperature data to play with:

We generated an array of random temperature values, along with arrays for the coordinates latitude and longitude (2 dimensions). First, let’s see how we can represent this data in pandas:

df = pd.DataFrame({"temperature":temperature, "lat":lat, "lon":lon})
df
Output of the above cell

We will create a DataArray from this data, let’s have a look at four ways to do that:

  • from a pandas Series
  • from a pandas DataFrame
  • using the DataArray constructor
  • using the DataArray constructor with projected coordinates

Creating DataArray from Series

We’ll create a pandas Series and then a DataArray. Since we want to represent 2 dimensions in our data, we will create a series with a 2 level multi-index:

idx = pd.MultiIndex.from_arrays(arrays=[lat,lon], names=["lat","lon"])s = pd.Series(data=temperature, index=idx)
s
# use from_series method
da = xr.DataArray.from_series(s)
da
Output of the above cell

This is what the data array looks like when printed.

Creating a DataArray from DataFrame

We can provide the DataArray constructor with a pandas DataFrame. It will consider the index of the data frame as the first dimension and the columns as the second. If the index or the columns have multiple levels xarray will create nested dimensions.

Since we want latitude and longitude as our dimensions, the smartest way to achieve this would be to pivot our data frame using latitude as index and longitude as columns:

df_pv = df.pivot(index="lat", columns="lon")# drop first level of columns as it's not necessary
df_pv = df_pv.droplevel(0, axis=1)
df_pv
Output of the above cell

With our data pivoted, we can easily create a DataArray by providing the DataArray constructor with our pivoted data frame:

da = xr.DataArray(data=df_pv)
da
Output of the above cell

Creating a DataArray using the constructor

So we’ve seen two ways to create a DataArray from pandas objects. Now let’s see how we can create a DataArray manually. Since we want to represent 2 dimensions in our data, the data should be shaped in a 2-dimensional array so we can pass it directly to the DataArray constructor.

We’ll use the data from the pivoted data frame, then we’ll need to specify the coordinates and dimensions explicitly:

Output of the above cell

The important thing to notice here is that coordinate arrays must be 1 dimensional and have the length of the dimension they represent. We had a (4,4) shaped array of data, so we supplied the constructor with two coordinate arrays. Each one is 1-dimensional and has a length of 4.

Creating a DataArray using the constructor with projected coordinates

We’ll check out one final way to create a DataArray, with projected coordinates. They might be useful in some cases, but they have one disadvantage, which is the coordinates have no clear interpretability. The big advantage of using them is that we can pass to the DataArray constructor arrays of the same shape, for both data and coordinates, without having to think about pivoting our data before.

In our case, we have temperature data, and we have two dimensions: latitude and longitude, so we can represent our data in a 2-dimensional array of any shape (it doesn’t have to be pivoted) and then provide the constructor with 2 coordinate arrays of the same shape for latitude and longitude:

Below, notice the way the coordinates are specified in the DataArray constructor. It’s a dictionary that its keys are the names of the coordinates, and its values are tuples that their first item is a list of dimensions, and their second item is the coordinate values.

da = xr.DataArray(data=temperature,
coords={"lat": (["x","y"], lat),
"lon": (["x","y"], lon)},
dims=["x","y"])
da
Output of the above cell

Notice that it says x and y are dimensions without coordinates. Notice, as well, that there’s no asterisk next to lat and lon because they are non-dimension coordinates.

3 dimensions

Now let’s create another dimension! Let’s create temperature data for 2 days, not 1 but 2!

Like before for every day we need a 2-dimensional (latitude and longitude) array for temperature values. To represent data for 2 days we will want to stack the daily arrays together, resulting in a 3-dimensional array:

Now we’ll pass the data to the DataArray constructor, with projected coordinates:

da = xr.DataArray(data=temperature_3d,
coords={"lat": (["x","y"], lat),
"lon": (["x","y"], lon),
"day": ["day1","day2"]},
dims=["x","y","day"])
da
Output of the above cell

We can also create the same thing using a pandas Series with a 3-level multi-index. To create a Series we will need to flatten the data, which means to make it 1-dimensional:

# make data 1-dimensional
temperature_1d = temperature_3d.flatten("F")
lat = lat.flatten()
lon = lon.flatten()
day = ["day1","day2"]

Now we’ll create a Series with a 3-level multi-index :

Output of the above cell

And finally, we’ll create a DataArray using the from_series method:

da = xr.DataArray.from_series(s)
da
Output of the above cell

Dataset

Up until this point, we only dealt with temperature data. Let’s add pressure data:

Now we’ll create a Dataset (not a DataArray) with temperature and pressure as data variables. With projected coordinates, the data_vars argument and the coords argument both expect a dictionary similar to the coords argument for DataArray:

Output of the above cell

We can also create a DataArray for each data variable and then create a Dataset from the DataArrays. Let’s create two DataArrays, for temperature and pressure, using the from_series method, just like we did with the 3-dimensional case:

Now we’ll create a Dataset using these two DataArrays:

ds = xr.Dataset(data_vars={"temperature": da_temperature, "pressure": da_pressure})ds

Conclusion

This was a quick introduction to xarray data structures. There are many more capabilities to xarray, like indexing, selecting and analyzing data. I won’t cover those here, because I want to keep this tutorial simple. I encourage you to take a look at xarray docs and try to play with it a little. Hope you found this article useful!

--

--