Towards Environmental Digital Twins in Azure with Dask and Pangeo

Remko de Lange
Towards Data Science
5 min readApr 16, 2021

--

Photo by J.A.Neshan Paul on Unsplash

Digital Twins of the environment can help reaching sustainability goals and tackling climate change related issues. They will strongly rely on geospatial data, and the processing and analytics thereof. Cloud environments provides the flexibility and scalability needed to cope with the potential enormous geospatial datasets. Here, I explore the Azure cloud capabilities, and I place them in a broader multi cloud perspective.

Environmental Digital Twin

Many definitions of Digital Twins exist, but a common denominator is a digital representation of a physical objects or process. Commonly, these objects are under control of the user of the Digital Twin. With Environment Digital Twins, I regard outdoor environments as the central ‘objects’, which does not hold an owner, or, in case of landowners for example, there is no one that holds full control of the object. There are at least three groups who can benefit from Environmental Digital Twins, these are:

- organizations that impact the environment

- organizations that plan for land usage and infrastructure

- organizations that rely on the environments for their operations

Inter Cloud Interactions

In another blogpost, I discussed the disperse data landscape we are currently facing when it concerns the large geospatial datasets, like satellite data or modelled atmospheric data. That situation will remain for many years, as I have argued that no organization can or is willing to host all these datasets. Because of the size of the data, the transfer of it from one environment to the other is not feasible all the time, and therefore, I regard the Inter Cloud Interactions as pivotal for successful Environmental Digital Twins. To minimize the data volume for transfer between the cloud or on-premise environments, the data should be enabled to process prior to transfer. In each cloud environment, processing capabilities should be provided for data selection, aggregation, and scoring, and for processing against own code. The remaining part of this blogpost will focus on technology mapping within the Microsoft Azure cloud environment that can facilitate the Inter Cloud Interactions (green rectangle shown in figure 1).

Figure 1. Digital Twin set-up in multi cloud environment, with the focus here on the Azure capabilities supporting the inter cloud interactions — green rectangle. Detailed explanation of the diagram can be found here (image by author)

Azure capabilities

The environmental sustainability targets of Microsoft are ambitious and can only be met in collaboration with industry, governments, NGOs and researchers. At some fronts, the working of the processes of natural interactions as well as the interactions with humans and System Earth is not well understood. Only armed with the knowledge and insights one can face climate change effectively with actionable insights. To facilitate scientists and researchers to increase these insights, Microsoft builds the Planetary Computer. The Planetary Computer will therefore help to understand where and how we can best mitigate the impact of humans on the environment, as well as to support decision makers for climate adaptation measures. The Planetary Computer consist of four major components that also can facilitate to build Environmental Digital Twins, these are:

- Planetary Computer Data Catalog — to search and find open datasets

- Planetary Computer API’s — to access data and retrieve data

- Planetary Computer Hub — for data processing, and goes together with an unified developer’s environment

- Planetary Computer Applications — third party open and closed applications build on Planetary Computer infrastructure

Organizations can build their Environmental Digital Twins partly based upon the Planetary Computer, that fully reside in Azure, together with all other Azure services. Here we map technologies of the Planetary Computer to the Environmental Digital Twin components that facilitate the Inter Cloud interactions seen from the Azure cloud, see figure 2.

Figure 2. Technology mapping for intra cloud interactions in Azure for Environmental Digital Twins supporting the inter cloud interactions (image by author)

Open data and data storage

Microsoft has onboarded Petabytes of data and made them available through the Planetary Computer, the current available datasets can be found in the Planetary Computer Data Catalog. The datasets include, among many others, Landsat 8, Sentinel-2, as well as a separate harmonized Landsat-Sentinel2 dataset, Aster etc. All data is stored on Azure blob and can be accessed directly through the blob storage APIs. As finding datasets on blob storage can be a challenge, they are indexed and made searchable through the open SpatioTemporal Asset Catalog — STAC — specifications.

Besides the data provided by the Planetary Computer, the Digital Twin owner will also have the need to bring in their own data, which with large datasets are likely to land on blob storage. Geospatial vector data however can land in Azure managed databases (SQL, Postgres, CosmosDB) or also on blob and made available through STAC OGC API-Features.

Compute engines

The Planetary Computer Hub facilitates researchers with Dask based clusters. Dask enables to parallelize Python code efficiently in a machine or on a cluster. These Planetary Computer clusters can be started and accessed through a web client, and they are pre-installed with a Pangeo environment. This enables researchers to work directly with the geospatial datasets within a stable environment full of cloud and geospatial and ML Python libraries. Researchers interact with the Dask cluster through JuypterHub or your local Python environment. The Planetary Computer Hub enables researchers to start quickly on cloud infrastructure without the prior knowledge of it. When larger clusters are needed then the Hub provides out-of-the-box, or when Digital Twin owners need their own infrastructure for their applications, Dask clusters can easily be created in an Azure subscription. With the Dask Cloud Provider, clusters can be created in Azure, as well is in other clouds. This also enables, in one go, to make use of the Pangeo docker image, so that the cluster comes pre-installed with the Pangeo environment, like the Planetary Computer Hub.

Unified development environment

The Planetary Computer is based on open-source tools, with open data and supports open standards, making it accessible, transparent and flexible to researchers and developers. No new tools or language is needed to be learned if the developer is already acquainted with Python, and existing Python code could be brought to the Dask environment. A Dask cluster can be accessed through a JupyterHub web client or via a local Python environment. As this approach works for multiple public cloud providers, with the Dask Cloud Provider, developers have a unified development environment at hand.

Last notes

There are cases that the above presented technology might not be optimal, especially when not only Python code needs to be applied, or with the need for a certain job schedular. Within a Digital Twin owner’s Azure subscription, many other options are available that can facilitate data processing and automation, like Azure Batch, Azure Databricks and Azure Synapse Analytics Spark Pool, Azure Kubernetes Services etc.

Reference to overarching blog post: Designing for Digital Twins

Opinions here are mine.

--

--