The world’s leading publication for data science, AI, and ML professionals.

Why Data Scientists Should Decluster Their Geospatial Datasets

A Python tutorial for dealing with the biased sampling of spatial data with a realistic example from the environmental industry

VIDEO TUTORIAL

Image by author
Image by author

Often times several decisions regarding a site or region are made based on statistical analyses of irregularly scattered geospatial data. The values of variables of interest for different applications are often impacted differently due to heterogeneities throughout a site. For example, meteorological data such as temperature could be influence by proximity to bodies of water, contaminant concentrations would be linked to directions of groundwater flow, natural resources such as ore, hydrocarbon, or forestry could be tied to the geological mediums in the subsurface.

If areas with anomalously high or low values for a given variable are sampled disproportionately to the average sampling for other areas in a given site, there will likely be a biased difference in the true distribution for that variable. When surveying a site, the areas that contain these anomalous values are often preferentially sampled because they represent areas of interest that we want to better understand. To avoid the biased difference which arises from preferential sampling, samples should be collected at regular gridded intervals or at random. The figure below illustrates how regular and random sampling does not skew the statistics of a site while biased sampling does.

Schematic illustrating how biased sampling results in a difference from the true mean for a given site when compared to regular or random sampling. Image by author
Schematic illustrating how biased sampling results in a difference from the true mean for a given site when compared to regular or random sampling. Image by author

The bias from the true mean might be due to the infeasibility of regular or random sampling, or more likely because there are targets of interest which leads to the preferential or biased sampling. This preferential or biased sampling is seen in nearly every industry that collects spatial data, especially within the oil and mining industries. To account for the irregular sampling and bias caused by the heterogeneous site-specific effects on the data, Declustering has been established as an essential geostatistical tool for analyzing spatial data.

Declustering biased geospatial data is important for achieving more representative statistics and better approximating the distribution of a spatial variable for any kind of decision making.

Although there are various different types of declustering such as polygonal declustering, in this tutorial we will only use cell declustering. Below is a video version of this article explaining the concept of cell declustering in greater detail and a running through the same environmental example presented here.

Following is an introduction to cell declustering and the next section goes through the simple environmental example in Python.


Cell declustering

Cell declustering is the most common declustering algorithm which uses a predefined cell to assign weights to each sample based on the number of samples in each cell and the area which that cell represents. The cell declustering concept is illustrated in the figure below.

Schematic of the cell declustering concept where equal-sized cells assign larger weights to smaller sample counts and smaller weights to larger sample counts within each cell. Image by author
Schematic of the cell declustering concept where equal-sized cells assign larger weights to smaller sample counts and smaller weights to larger sample counts within each cell. Image by author

Throughout the tutorial we will often be referring to either the naive or declustered data which are defined as:

  • Naive refers to the data in its raw format without any declustering
  • Declustered refers to the data which has incorporated the declustering weights for each sample

Note that in this simple tutorial we will be abstracting a lot of the math behind declustering, refer to [1] if you are interested and want to learn more about the theory behind the declustering concepts introduced here.

Cell size

The most important input required to run cell declustering is the cell size. Typically the cell size should be equal to the coarsest spacing of samples within a dataset but this can sometimes be difficult to judge.

Varying the cell size has a big impact on the declustering weight each sample will receive. The figure below illustrates three different cell size scenarios ranging from the smallest possible to an adequate size and finally the largest cell size possible. Using either the smallest cell size where each cell is composed or one sample or the largest cell size where one big cell contains all the samples results in the same outcome.

Schematic illustrating the declustering weight for each sample for the smallest, adequate, and largest cell size. Image by author
Schematic illustrating the declustering weight for each sample for the smallest, adequate, and largest cell size. Image by author

The best approach for choosing an appropriate cell size is by assessing the declustered mean for various cell sizes and choosing a value which is close to minimizing or maximizing the declustered mean. The figure below illustrates how the declustered mean varies with increasing cell size. Whether the declustered mean increases or decreases from the naive mean will depend on if the clustered, biased samples are greater than or less than the naive mean. The most appropriate cell size is often somewhere near where the declustered mean is maximized or minimized.

Schematic showing the typical relationship between increasing cell size and declustered mean. Image by author
Schematic showing the typical relationship between increasing cell size and declustered mean. Image by author

Number of cell offsets

Another important parameter for cell declustering is defining the number of cell offsets. Since the origin of the declustering cells will impact the declustering weight each sample receives, various random origins are used to provide an average weight to each sample. The figure below shows an example of using three and nine random origins. Typically, anywhere from 25 to 100 random origins are used in practice but this depends on the complexity of the dataset.

Schematic illustrating 3 random origins for the left plot and 9 random origins for the right plot. Image by author
Schematic illustrating 3 random origins for the left plot and 9 random origins for the right plot. Image by author

After defining the cell size and number of cell offsets, cell declustering can be run to assign each sample a declustering weight.


Declustering a synthetic environmental dataset

The dataset we are going to use is a synthetic arsenic contaminant dataset consisting of 153 groundwater samples scattered in 2D space measuring concentrations of arsenic measured in μg/L. The World Health Organization (WHO) guidelines state that arsenic concentrations for drinking water should not exceed 10 μg/L so we will analyze the dataset and verify that the average arsenic concentration is below the healthy limit so that it is safe to drink.

The script and dataset used in this example are available in this GitHub repository.

Required libraries and data preprocessing

We are eventually going to use the geostatspy library which can be downloaded by entering the line below in the terminal.

pip install geostatspy

In addition to geostatspy, we will also make use of three standard python libraries which we can import as shown below.

import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import geostatspy.geostats as geostat

Now we can load our data into a pandas data frame which we can do by running the code below.

dataframe = pd.read_csv('Synthetic_Arsenic_Data_FF.csv')
dataframe

The above code also gives us an initial look into the dataset to see how it is formatted as shown in the following output.

The synthetic environmental dataset in use consisting of easting and northing coordinates with measured arsenic concentrations. Image by author
The synthetic environmental dataset in use consisting of easting and northing coordinates with measured arsenic concentrations. Image by author

We can visualize the dataset using numpy and matplotlib, the figure below shows a plot of all the samples and their arsenic concentration. The clusters of high arsenic concentration in the northwest and southeast are clear signs of preferential sampling which was probably done in an attempt to delineate these anomalous areas of interest.

Plot of the arsenic contaminant samples showing clusters of high arsenic values in the northwest and southeast. Image by author
Plot of the arsenic contaminant samples showing clusters of high arsenic values in the northwest and southeast. Image by author

Taking the mean of these samples results in an arsenic concentration of 12.60 μg/L which is above the WHO guidelines for healthy drinking water but this may be an artifact of the biased sampling. Next, we will decluster the dataset to see if we can achieve a more accurate or representative mean.

Dataset declustering

We can run the declustering by using the geostatspy package we installed earlier, but there are a few parameters here which require the cell declustering inputs:

  • ncell is the total number of cell sizes we want to use which are linearly interpolated between the smallest cell size, cmin, and the largest cell size, cmax, and these are in the same unit as the coordinates in the dataset
  • noff is the number of random offsets we want to use
  • iminimax is a binary option where setting it equal to 0 will choose the cell size that maximizes the declustered mean, and setting it equal to 1 will choose the cell size that minimizes the declustered mean

In this example, we will use 200 cells interpolated from 1 m to 750 m and 25 random cell offsets. Also, since our biased clusters are higher than the naive arsenic mean, we will set iminmax equal to 1 so that the weights we obtain use the cell size that minimizes the declustered mean. The declustering code can be seen below.

W,Csize,Dmean=geostat.declus(dataframe,'East','North','Arsenic',iminmax = 1, noff= 25, ncell=200,cmin=1,cmax=750)
#W is the output weights
#Csize is the output cell sizes
#Dmean is the declustered means for each cell size

We have also set three variable outputs which are:

  • Weights these are the output declustering weights for each sample in the same order as our data
  • Cell sizes these are the values for each tested cell size in the same unit as the coordinates of our dataset
  • Declustered means these are the averages for all samples with the declustering weight incorporated in the calculation for each tested cell size, the declustered means are in the same unit as our variable of interest

Let’s take a look at the declustered means for the tested cell sizes by plottingCsize against Dmean from the above code.

Declustering means for all the tested cell sizes used in the example, the naive mean and minimizing cell size has been overlain in the plot. Image by author
Declustering means for all the tested cell sizes used in the example, the naive mean and minimizing cell size has been overlain in the plot. Image by author

The weights which are saved are those that achieved the minimum declustered mean. Below is a plot of the samples showing the declustering weights and allowing us to visually check if the sample clusters have a smaller weight compared to the far off samples representing larger amounts of area. As we expect and is shown in the figure below, the sample clusters have smaller weights while samples with fewer points close by have higher weights.

Plot of the declustering weights for each sample obtained after running the cell declustering. Image by author
Plot of the declustering weights for each sample obtained after running the cell declustering. Image by author

Comparing the naive and declustered statistics

When performing any kind of statistical analysis, incroporating the declustering weights will yield more representative statistics compared to the naive data. Since locations with higher concentrations of arsenic were oversampled, we expect the declustered statistics to reduce the number of high arsenic samples in our data. The histograms below show how the declustered arsenic accounts for this bias as the counts of lower arsenic concentrations are higher than the naive arsenic thus correcting the bias and achieving more representative statistics.

Histograms of arsenic concentration for the naive data in the left and declustered data in the right. Image by author
Histograms of arsenic concentration for the naive data in the left and declustered data in the right. Image by author

It is important to note that the actual values of the samples do not change, this can be observed when comparing the minimum and maximum arsenic concentration of the naive and declustered data as they are the same. The weights control the frequency of each sample. For example, a sample with a value of 5 and a weight of 2 when declustered would be the same as two samples with a value of 5, not one sample with a value of 10. This is why we see differences in statistical parameters such as the quantiles and mean but not the minimum and maximum as shown in the table below.

Comparison of the descriptive statistics for the naive and declustered arsenic concentrations. Image by author
Comparison of the descriptive statistics for the naive and declustered arsenic concentrations. Image by author

As shown in the table above, the mean arsenic concentration for the naive data is 12.60 μg/L which is over the WHO guidelines. The mean arsenic concentration for the declustered data is only 8.82 μg/L which is considered safe drinking water by the WHO guidelines. By incorporating declustering to account for the biased sampling in the dataset, money that would have been spent remediating this site can now go elsewhere as there is no threat due to arsenic contamination.


Summary

Declustering is a crucial step when conducting any form of Spatial Data Analysis in which the samples have been preferentially sampled. Currently, scientists and engineers who are unfamiliar with analyzing geospatial data may be making mistakes as they are working with the naive data and are not accounting for any kind of biased sampling.

The cell declustering parameters used will vary significantly depending on the dataset. While there isn’t a perfectly scientific approach for selecting the optimal parameters for cell declustering, an easily applicable workflow was presented for selecting adequate parameters based on what is typically done in practice.

It is also worth noting that declustering should not be pursued when samples are taken at regular or random intervals as there will be no bias to correct. While regular or random sampling would make spatial data analysis easier, there are several reasons for which geospatial data may be preferentially sampled such as: (1) To better understand anomalous areas of interest, (2) Lack of resources for which to conduct enough sampling, or (3) Inaccessible sites within the survey grid to name a few.

Although the example presented here was in 2D, the approach is the same for a 3D dataset but will require more sophisticated libraries than the ones used here. A 3D spatial dataset would be comprised of the easting and northing coordinates but samples would also now vary in elevation. If you are curious, below is an article that visualizes a 3D geospatial dataset from the mining industry which would require declustering.

Easy techniques for visualizing 3D subsurface borehole data

References

[1] A. G. Journel, Nonparametric estimation of spatial distributions (1983), Journal of the International Association for Mathematical Geology 15: 445–468.


Related Articles