The world’s leading publication for data science, AI, and ML professionals.

Understanding Point Process Model with R

I have recently been playing with spatial data as a learning exercise and I have been particularly interested in point pattern analysis…

Hands-on Tutorials

A hands-on introduction to point process models with R

I have recently been playing with spatial data as a learning exercise and I have been particularly interested in point pattern analysis, often use in epidemiology and developing in ecology. While some packages already have canned function for this (see the excellent inlabru or the very well known spatstat), I prefer not to rely on them. As I wanted to improve my understanding of point pattern models and decided to use rstanwhere I need to code my model from scratch. Stan, this is a probabilistic programming language to obtain Bayesian inference using the Hamiltonian Monte Carlo algorithm.

Point pattern and point process

There are many resources introducing to the notions of point patterns and point processes but I will quickly explain these two notions here.

Point pattern

A point pattern represents the distribution of a set of points in time, space or higher dimensions. For instance, the location of trees in a forest can be thought as a point pattern. The location of crimes is another example of point pattern. There are three general patterns:

  • Random : any point is equally likely to occur at any location and the position of a point is not affected by the position of other points. For example, if I throw a bag of marble on the floor it is likely that the pattern will be random.
  • Uniform : every point is as far from its neighbors as possible. For example, we can think of a human-made forests where trees are regularly placed.
  • Clustered : many points are concentrated close together, possibly due to a covariate. We can take the example of bees locations in a field, locations will likely cluster around flowers. The point pattern that we simulate in this post represent a clustered point pattern.

Point process

A Spatial point processes is a description of the point pattern. We can think of it as the model which generated the point pattern. The points arise from a random process, described by the local intensity λ(s), which measures the expected density of points at a given location, s, in space. If points arise independantly and at random, the local intensity can be described by a homogenous Poisson distribution and is refered to as a Poisson point process. If event locations are independant but the intensity varies spatially, the distribution arises from an inhomogenous point process (i.e. λ(s) varies). The latter is also called inhomogenous Poisson process.

We can model the intensity of the inhomogenous point process as a function of covariates. We describe this type of model as follow:

λ(s)=exp(α+β∗X(u))

Where X(u) is a spatial covariate and α and β are parameters to be estimated. Z(u) can represent the pH of a soil for instance or temperature in the air.

R libraries

To replicate this tutorial you will need to load the following libraries:

library(spatstat)
library(sf)
library(sp)
library(maptools)
library(raster)
library(rstan)
library(tidyverse)
library(cowplot)

Simulating a point pattern

First, I need to simulate a point pattern. In my opinion there is countless benefit to simulating data and it particularly helps to understand the generation process (i.e. how is the data created). From a pragmatic perspective, when generating data we have total control over the parameters and it is very easy to see if we made a mistake when fitting a model.

Here is a function which internally generate a point pattern based on our values of α, β and the dimensions of our study area. Note that the dim[1] and dim[2] have to be equal.

The function returns a list of 3 objects:

  • The number of points in each grid cell. This will be useful when fitting the model in stan.
  • A ppp object which is a spatstat object. This will be helpful when fitting the model with spatstat
  • The covariate, which is a grid of values
genDat_pp <- function(b1, b2, dim, plotdat = TRUE){

  # Define the window of interest
  win <- owin(c(0,dim[1]), c(0,dim[2]))

  # set number of pixels to simulate an environmental covariate
  spatstat.options(npixel=c(dim[1],dim[2]))

  y0 <- seq(win$yrange[1], win$yrange[2],
            length=spatstat.options()$npixel[2])
  x0 <- seq(win$xrange[1], win$xrange[2],
            length=spatstat.options()$npixel[1])
  multiplier <- 1/dim[2]

  # Simulate the environmental covariate
  gridcov <- outer(x0,y0, function (x,y) multiplier*y + 0*x)
# Set the coefficients
  beta0 <- b1
  beta1 <- b2

  # Simulate the point pattern
  pp <- rpoispp(im(exp(beta0 + beta1*gridcov), xcol=x0, yrow=y0))

  # We count the number of points in each grid cell, plot it and make a vector out of it
  qcounts <- quadratcount(pp, ny=dim[1], nx=dim[2])
  dens <- density(pp)
  Lambda <- as.vector(t(qcounts)) # We have to use t() as we need to # construct the vector with the column first

  if(plotdat == TRUE){
    par(mfrow=c(1,2), mar=c(2,2,1,1), mgp=c(1,0.5,0))
    plot(im(gridcov), main = 'Covariate')
    plot(dens, main = 'Intensity')
  }
  # Return a list with which I can play with
  return(list(Lambda = Lambda, pp = pp, gridcov = gridcov))
}

I set a seed for the results to be replicated and choose the parameters for the simulation:

set.seed(123)
b1 <- 2
b2 <- 3
dim <- c(20,20)

And finally generate my point pattern. The function generated two plots, the first one is the simulated covariate and the second one is the simulated intensity of the point pattern.

pp <- genDat_pp(b1, b2, dim)

Et voila! We now have data we can play with!

Fitting point process model with spatstat

As a basic check I fit the model with the function ppm() from the package spatstatto be sure I am able to recover the parameters I have previously specified. spatstatis certainly the most successful package when it comes to Spatial Analysis and have plenty of good functions for fitting point process models but also to describe the point pattern. Check out the website!

cov <- im(pp$gridcov)
fit <- ppm(pp$pp ~ 1 + cov)
fit$coef
## (Intercept)         cov 
##    2.187846    2.788411

The coefficients of the model are coherent with the coefficients I specified.

Fitting the point process model in stan

My code for the point process model in stan is as follow:

ppm_stan <- '
data{
  int<lower = 1> n;
  vector[n] x;
  int<lower = 0> y[n];
}
parameters{
  real beta0;
  real beta1;
}
transformed parameters{
}
model{
  //priors
  target += normal_lpdf(beta0 | 0,5);
  target += normal_lpdf(beta1 | 0,10);
// likelihood
  target += poisson_log_lpmf(y | beta0 + beta1 * x);
}
generated quantities{
  vector[n] lambda_rep;
  lambda_rep = exp(beta0 + beta1 * x);
}'

We next fit this model:

stan_data = list(n = length(pp$Lambda), 
x = as.vector(t(pp$gridcov)), y = pp$Lambda)
fit_stan <- stan(model_code = ppm_stan, data = stan_data, 
                 warmup = 500, iter = 2000, chains = 3)

And check if the coefficients are coherent with the ones I specified:

print(fit_stan, pars = c('beta0', 'beta1'))
## Inference for Stan model: 200e1419a3bfac13c3097743f3003142.
## 3 chains, each with iter=2000; warmup=500; thin=1; 
## post-warmup draws per chain=1500, total post-warmup draws=4500.
## 
##       mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## beta0 2.12       0 0.02 2.08 2.11 2.12 2.14  2.17  1003    1
## beta1 2.97       0 0.03 2.91 2.95 2.97 2.99  3.02  1029    1
## 
## Samples were drawn using NUTS(diag_e) at Wed Dec 02 10:20:18 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

This is also coherent. We notice that the coefficients Stan return are a bit higher than what ppm() gave us.

Comparing spatstat and rstan output

A first thing we can do is to check if the predictions seem correct. This can help us develop our intuition about which software performs better and help us double check if the model fit correctly.

Making the prediction for the ppm() object is simpler than with the stan object, but it is still relatively straightforward:

# spatstat predictions
pred <- predict(fit)
# Stan predictions
lambda_rep <- as.data.frame(rstan::extract(fit_stan)['lambda_rep'])
mean_lambda_rep <- apply(lambda_rep, 2, 'mean')

We then create a grid in which we will gather all the predictions

pointp <- pp$pp
pp_sp <- as.SpatialPoints.ppp(pointp)
pp_sf <- st_as_sf(pp_sp)
grid <- st_make_grid(pp_sf, n = dim, what = 'polygons') %>% st_as_sf()
grid$pred_stan <- mean_lambda_rep
grid$pred_spatstat <- as.vector(t(pred$v))
grid$intensity <- pp$Lambda

Plot the predictions

plot(grid)

Somehow, spatstat do not predict in some cells.

We can also plot the coefficients, this will help us vizualise the error associated to the parameter values.

Graphically we can see that stan performs better than the ppm() function. We can formalize this intuition by computing the sum of the Root Mean squared error.

We define our helper function rmse:

rmse <- function(true, observed){sqrt((true - observed)^2)}

And we calculate. First note that we get rid of the NAs values as spatstat did not predict in certain grid cells.

grid_no_na <- grid %>% filter(!is.na(pred_spatstat))
sum(rmse(grid_no_na$intensity, grid_no_na$pred_stan))
## [1] 1993.056
sum(rmse(grid_no_na$intensity, grid_no_na$pred_spatstat))
## [1] 2551.068

The sum RMSE for the point process model fitted in stan is inferior to the sum rmse fitted with spatstat which make us conclude that in this case stan performed better.

In conclusion

In this article I have introduced the concepts of point patterns and point process models. PPMs are a very powerful class of models when dealing with point patterns, frequently encountered in spatial data analysis. As I mentioned it is important to understand the model you are fitting and while Stan may not be optimal for PPMs (INLA is way more efficient) coding your model from scratch will give you a better grasp of what your model is doing and how to interpret its results.

In Part 2 we will take the point processes case further and we will see how to simulate and fit a Cox process – which is an inhomogenous Poisson process with a random intensity function. We will also take the case of the Log gaussian Cox process in which the log-intensity of the Cox process is a Gaussian process!


Related Articles