The world’s leading publication for data science, AI, and ML professionals.

Environmental Data Science: An Introduction

Examples, challenges, and perspectives for working with environmental data

Photo by Francesco Gallarotti on Unsplash
Photo by Francesco Gallarotti on Unsplash

Human life is deeply intertwined with the environment. In the current geological epoch, the Anthropocene, we shape the environment through the release of greenhouse gases and chemical products, sprawling infrastructure, and agriculture.

For the data scientist, a natural way to interact with a topic is to look at the available data and its potential. The field of environmental data science is relatively new, but growing in popularity.

The manifestation of Climate Change, the loss of biodiversity, and the increase in pollution reaching even to the deep sea, have heightened our sensitivity to the environment. Today, sustainability is a major focus of political and non-governmental activity, and the question of how we can reconcile our livelihoods with the preservation of the environment must be be urgently addressed.

The Climate Change AI initiative is collaborating with major machine learning conferences, an open source journal of Environmental Data Science has been launched, and numerous graduate programs at the intersection of environmental studies and data science are being established, such as at Imperial College London.

To my knowledge, there is no clear definition of environmental Data Science. In this blog post, I will share my experiences with environmental data science, based on my experience as an AI consultant working in the domain. First, I will illustrate the diversity of environmental data science with three examples:

  1. Biosphere monitoring (classification)
  2. Air pollution forecasts (time series)
  3. Flood damage drivers (feature importance)

I will then discuss the challenges associated with environmental data, related to data scarcity, quality, and complexity. Environmental data is different from data that encountered in other areas of machine learning, and I will provide my perspective on how these challenges can be addressed.

Finally, I will outline the perspectives I see if we can harness environmental data and combine the power of data science and machine learning with the growing demand for sustainable solutions.


Monitoring Wildlife with Image Classification

Species identified in a photograph by a machine learning algorithm. Original image: By GIRAUD Patrick - Own work, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1093844. Annotations: Author.
Species identified in a photograph by a machine learning algorithm. Original image: By GIRAUD Patrick – Own work, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1093844. Annotations: Author.

As human activities expand into remote regions, many animal and plant species are threatened with extinction. Wildlife conservation efforts rely on accurate monitoring of the various species of interest. In Tuia et al, Nature 2022, the authors list non-invasive data collection devices such as "camera traps, consumer cameras, acoustic traps [… and] on-animal devices".

The data collected can be efficiently analyzed using machine learning algorithms. For example, a camera trap near a waterhole captures images whenever an animal passes by. These images can be classified using computer vision algorithms such as convolutional neural networks. You may have encountered the cat-vs-dog classification tutorials, which can be adapted to classify any other animal species given enough labeled data.


Forecasting Air Quality Time Series

Industrial activity, transportation, and individual traffic affect the release of aerosols into the atmosphere. Aerosols are light particles that can potentially damage the lungs and plants. Their dispersion depends not only on the initial concentration of the release, but also on the weather. On hot days without a lot of wind, aerosols will stay near the ground for much longer than on days with a light breeze.

Many cities have implemented aerosol monitoring systems, where stations at fixed locations measure aerosol concentrations around the clock. The result is a time series, as shown in this example from the city of Bristol, UK:

NOx concentration at three stations in Bristol in 2022. Data: https://data.opendatasoft.com/explore/dataset/air-quality-data-continuous%40bristol, Bristol City Council, Open Government License. Plot: Author.
NOx concentration at three stations in Bristol in 2022. Data: https://data.opendatasoft.com/explore/dataset/air-quality-data-continuous%40bristol, Bristol City Council, Open Government License. Plot: Author.

Initially, this data will be used for monitoring purposes, such as checking that certain air quality targets are being met. But we can also use the data for analysis, to identify the driving factors of bad aerosol conditions. Similar to predicting stock prices, we can use machine learning to generate air quality forecasts that help to mitigate the effects of aerosols.


Example 3: Hydrology—Flood Damage Monitor

Photo by Jonathan Ford on Unsplash
Photo by Jonathan Ford on Unsplash

Floods are the most costly natural disasters. The major flood event in Pakistan in 2022 resulted in the loss of 1,700 human lives, the displacement of more than 8 million people, and an estimated 15 billion US$ in damages [[source](https://blogs.worldbank.org/climatechange/flood-risk-already-affects-181-billion-people-climate-change-and-unplanned)]. An estimated 1.81 billion people live in regions that could be affected by flooding [source].

As the intensity and frequency of flood events increases due to a warming climate and land use changes, there is an urgent need to understand the drivers of flood damage. The HOWAS database collects qualitative and quantitative data on flood events, such as the proximity of buildings to water, warning lead times, and building characteristics. It is currently expanding beyond its original focus on German and Austrian data to include global data.

In Kellermann et al, The object-specific flood damage database HOWAS 21, Natural Hazards and Earth System Sciences (2021), a random forest regression algorithm was used to predict total building damage. The various input features were ranked according to their importance to the algorithm. High-quality datasets like these can help to improve flood warnings and mitigate the damage.

Driving factors for flood damage as obtained by feature importance ranking. Credit: Kellermann et al, https://doi.org/10.5194/nhess-20-2503-2020.
Driving factors for flood damage as obtained by feature importance ranking. Credit: Kellermann et al, https://doi.org/10.5194/nhess-20-2503-2020.

Challenges

Based on my experience as a research data scientist working in the field, I see four major challenges in the use of environmental data.

Data Scarcity

Collecting environmental data can be a menial task. To get the data, you actually have to physically go to a location, set up sensors, or collect samples by hand. Compared to customer data in an online store, where data simply flows into your database over the internet, the cost of obtaining a single sample is much higher. The inaccessibility of some regions can make it difficult to collect data with sufficient spatial and temporal resolution.

Often, environmental datasets consist of a few hundred samples that have been obtained at great expense. Data analysis and algorithms must be carefully selected and adapted as needed.

Specialized data formats

For historic reasons or for convenience, environmental data often comes in formats that are not easily recognizable by machine learning algorithms and data science pipelines. From my experience in training environmental data scientists, the trainees are therefore often struggling to translate the concepts from standard machine learning tutorials to environmental data.

Many machine learning routines rely on data that is shaped like the standard warehouse data, is accessible as csv or json files, or comes in standard image formats. Environmental data is often collected by a small number of people and the collection protocols do not consider data scientific downstream applications. Hence, data engineering and cleaning is often a major task.

Data quality

Even when data is available in sufficient quantity, the data quality may fluctuate significantly. For example, photographic observations of the environment depend on daylight and camera angle, and images can look very different from shot to shot. Sensors can fail and degrade in quality, and qualitative databases like HOWAS are expensive to maintain.

Domain Knowledge

Domain knowledge is crucial for processing environmental data, understanding the layout, and asking the right questions. Labeling environmental data often requires a substantial amount of expertise. Therefore, interdiscplinary projects are the norm, where data scientists and machine learning specialists work closely with domain experts.

Perspectives

Despite the challenges associated with working with environmental data, I believe that environmental data science has great potential. Sustainability is a major focus point of today’s political and non-governmental discussions, and good data can help make informed decisions. Businesses are developing strategies based on real-time data, and it would be great to see a similar impact of data in sustainability discussions.

Aspiring environmental data scientists may find working in the sustainability sector to be fruitful and rewarding. In projects related to the protection of natural resources, monitoring of the environment, and implementation of sustainability policies, they find more purpose than in developing the millionth algorithm for online marketing or cryptocurrency prediction.

I hope that in the future environmental data science will become an established area of data science and machine learning. Educating engineers and scientists who can work at the intersection of environmental studies and programming is a first step towards increasing the maturity of the field. As high-quality data becomes more available and interest in sustainability grows, I am confident that environmental data science will be important in the future.

Photo by Marc Schulte on Unsplash
Photo by Marc Schulte on Unsplash

Summary

To summarize this introduction to environmental data science, I’ve provided three examples that illustrate the diversity of the field. There are unique challenges to working with environmental data, mainly related to data scarcity and data quality. Interdisciplinary projects with domain experts and data scientists can take advantage of the unanalyzed environmental data. I hope that in the future, environmental data science will reach a stage of maturity where environmental data directly informs decisions and improves sustainability.


Further reading


Related Articles