Spatial Regression Using Fabricated Data

Another dimension of modelling in R

Dr. Marc Jacobs
Towards Data Science
10 min readFeb 27, 2022

--

There are numerous posts already available on spatial regression modelling, dealing with geodata, and making fancy plots of data based on location coordinates. For me, I had no intention to post on spatial regression and was actually busy exploring ways to easily simulate hierarchical data using the fabricatr package. That brought me to an example containing spatial data and further exploring ways of dealing with data containing a spatial covariance matrix. And so, here I am, writing about it.

It’s not a comprehensive post, but it does show some of the key elements of spatial regression. And why fancy techniques do nothing if the data does not really contain anything. Let’s see and explore!

library(fabricatr)
library(lattice)
library(sf)
library(mapview)
library(data.table)
library(tidyverse)
library(sp)
library(gstat)
library(geoR)
library(viridis)
library(ggExtra)
library(grid)
library(gridExtra)
library(GGally)
library(MASS)
library(mgcv)
library(tidymv)
library(sjPlot)
library(lme4)
library(wakefield)
library(simsurv)
library(forecast)
simulated_quake_data <- fabricate(
data = quakes,
fatalities = round(pmax(0, rnorm(N, mean = mag)) * 100),
insurance_cost = fatalities * runif(N, 1000000, 2000000))
head(simulated_quake_data)
simulated_quake_data<-simulated_quake_data%>%
group_by(lat, long)%>%
mutate(mag=mean(mag),
fatalities=mean(fatalities),
insurance_cost=mean(insurance_cost))%>%
distinct(., lat,long, .keep_all= TRUE)
table(duplicated(simulated_quake_data[,c(1,2)]))

So, the the dataset that brought me where I am now is the quake dataset which you can augment using the fabricatr package. It looks like this:

Now, one of the things no spatial model likes is duplicate location data so you need to get rid of it.

And then it is off to some fancy graphics showing where the data is actually coming from. It shows clearly the earthquakes in the Fiji region. Too bad the data does not have a temporal component. That would make it even more interesting to model.

mapview(simulated_quake_data, 
xcol = "long",
ycol = "lat",
crs = 4269,
grid = FALSE)
Nice overlaid data. Image by author.
sbux_sf <- st_as_sf(simulated_quake_data, 
coords = c("long", "lat"),
crs = 4326)
plot(st_geometry(sbux_sf))
Same graph, but no overlay. Image by author.
sqbin <- ggplot() +
geom_bin2d(
data=as.data.frame(st_coordinates(sbux_sf)),
aes(x=X, y=X))+theme_bw()
hexbin <- ggplot() +
geom_hex(
data=as.data.frame(st_coordinates(sbux_sf)),
aes(x=X, y=Y)) +
scale_fill_continuous(type = "viridis")+theme_bw()
grid.arrange(sqbin, hexbin, ncol=2)
ggplot(sbux_sf, aes(x=mag))+
geom_histogram(bins=50, aes(y=..density..))+
geom_density(fill="#FF6666", alpha=0.5, colour="#FF6666")+
theme_bw()
And the data binned, using hexagons to see where the majority of quakes are happening. Image by author.
The density of the magnitudes. Image by author.
ggplot(data = sbux_sf) +
stat_density2d_filled(
data = as.data.frame(st_coordinates(sbux_sf)),
aes(x = X, y = Y, alpha = ..level..),
n = 16) +
scale_color_viridis_c() +
theme_bw() +
labs(x="Longitude", y="Latitude")+
geom_sf(alpha=0)+
theme(legend.position = "none")
And another way to plot the data based on latitude and longitude. Something went amiss in the creation of the graph it seems, because I am not getting all the data on the axes. Image by author.

And here, another way to connect the dots on the data and use latitude and longitude coordinates to build geometric points and model, predict, and plot plot the data. The model used applies inverse distance weighting although numerous other methods can be used, like Kriging for instance.

sd.grid <- sbux_sf %>%
st_bbox() %>%
st_as_sfc() %>%
st_make_grid(
n = 100,
what = "centers"
) %>%
st_as_sf() %>%
cbind(., st_coordinates(.))
idw.hp <- idw(
mag ~ 1,
locations = sbux_sf,
newdata=sd.grid,
nmax = 150)
idw.hp = idw.hp %>%
cbind(st_coordinates(.))
g1<-ggplot(idw.hp, aes(x = X, y = Y, fill = var1.pred)) +
geom_raster() +theme_bw()+theme(legend.position = "none")
g2<-ggplot(idw.hp, aes(x = X, y = Y, fill = var1.pred)) +
geom_raster() +
scale_fill_viridis_b() +
theme_void() +
geom_sf(alpha=0)+theme(legend.position = "none")
grid.arrange(g1,g2)
Nice plot — you can make the contours of the islands of Fiji. Image by author.
Same here. These plots are just reproductions of the original plot. Image by author.
mapview(sbux_sf, map.types = "Stamen.Toner") 
coordinates(simulated_quake_data) <- c("long", "lat")
proj4string(simulated_quake_data) <- CRS("+proj=longlat +datum=WGS84")
lzn.vgm <- variogram(log(mag) ~ lat + long, simulated_quake_data, width=0.1)
plot(lzn.vgm)
lzn.fit = fit.variogram(lzn.vgm,
vgm(c("Gau", "Sph", "Mat", "Exp")),
fit.kappa = TRUE,
fit.method= 1)
plot(lzn.fit, 1200)
And another one. Image by author.

I think we have now had our fair share of maps and views of the data at hand. Time to move into variograms. For those who do not know what a variogram is — it’s a way to combine the distances between locations and their (co-variance). A spatial co-variance matrix. The trick is to build a plot that shows you at which distance the variances between points no longer matter. In other words, when they are no longer connected. Just like there are different types of non-spatial covariance matrixes, there are different types of spatial co-variance matrices. In fact, the term spatial does not really mean anything since any covariance matrix is determined to connect variance and covariance between adjacent and further-away points. Spatial here just means that you will include location data. Nothing more.

Looks like a straight line to me — with a whole lot of noise. This plot does not help — at all. Image by author.
This plot is the graphical depiction of the outcome of the lzn.fit model above. Image by author.

Above, you see two models, the Nug and the Sph which means Nugget and Spherical. The Spherical part is just the way the co-variance matrix is build up, like autoregressive, toeplitz of unstructured. Later on, I will show you different spatial covariance matrices.

The nugget is the most interesting. Nugget means ‘starting point’ — so where do you want your spatial covariance matrix to start, in terms of height. Such a matrix does not have to include a nugget.

Let’s start exploring that kind of data. Many packages exist, but I will end up with my favorite which is the nlme package. Many other packages call upon that package to extend it.

simulated_quake_data <- fabricate(
data = quakes,
fatalities = round(pmax(0, rnorm(N, mean = mag)) * 100),
insurance_cost = fatalities * runif(N, 1000000, 2000000))
simulated_quake_data<-simulated_quake_data%>%
group_by(lat, long)%>%
mutate(mag=mean(mag),
fatalities=mean(fatalities),
insurance_cost=mean(insurance_cost))%>%
distinct(., lat,long, .keep_all= TRUE)
EC97 <- as.geodata(simulated_quake_data,
coords.col = 1:2,
data.col = 4)
Var1.EC97 <- variog(EC97, trend = "1st")
Var2.EC97 <- variog(EC97, trend = "2nd")
plot.new()
par(mfrow = c(2, 1))
plot(Var1.EC97, pch = 19, col = "blue", main = "1st order variogram")
plot(Var2.EC97, pch = 19, col = "red", main = "2nd order variogram")
plot.new()
par(mfrow = c(1, 1))
plot(Var1.EC97, pch = 19, col = "blue", main = "Variogram Simulated Quake Data")
par(new = TRUE)
plot(Var2.EC97, pch = 19, col = "red", xaxt = "n", yaxt = "n")
ini.vals <- expand.grid(seq(0, 30, by = 1), seq(2,30, by = 1))
ols <- variofit(Var1.EC97, ini = ini.vals, fix.nug = TRUE, wei = "equal")
summary(ols)
wls <- variofit(Var1.EC97, ini = ini.vals, fix.nug = TRUE, wei = "npairs")
lines(wls)
lines(ols, lty = 2, col = "blue")
Botha first and second order variogram show the same. There really is not a lot of spatial dependence when trying to connect locations to the magnitude of the earthquakes. I have tom say, I do find that quite intriguing. but I do not have enough knowledge to even try and understand why that is. From a geological point of view. Image by author.

Let’s try and model the data differently by focusing on fatalities and not magnitude. Here, you see my building a simple linear model on coordinate data. The variogram follows naturally.

Model1=lm(fatalities~depth+mag,
data=sbux_sf)
sbux_sf$residuals=residuals(Model1)
sbux_sf<-sbux_sf%>%cbind(., st_coordinates(.))
Vario_res = variogram(residuals~X+Y,
data=sbux_sf,
cutoff=100,
alpha=c(0,45,90,135))
plot(Vario_res)
Despite angle changes it seems to consist of a flat line. Do not let the variance fool you. Image by author.

Let’s try again, although I am pretty sure the results will be the same. We can build a general linear model coming from the nlme package and ask it to build a spherical covariance matrix.

modSpher = gls(fatalities~depth+mag,
data=sbux_sf,
correlation=corSpher(c(100),
form=~X+Y,
nugget=F))
VarioSpher_raw = Variogram(modSpher,
form =~ X+Y,
robust = TRUE,
maxDist = 100,
resType = "pearson")
VarioSpher_normalized = Variogram(modSpher,
form =~X+Y,
robust = TRUE,
maxDist = 100,
resType = "normalized")
plot(VarioSpher_raw)
plot(VarioSpher_normalized)
I guess the model, especially the studentized residuals variogram, see more then a straight line. I always find that tricky, but it shows that by increasing the distance, you decrease the co-variance. That makes sense. Image by author.

Here, another example to guild a glm model, without spatial covariance matrix, and save its output to build a bubble plot of residuals plotted on the coordinates of the original data. It’s actually quite a handy way to plot spatial data when you get the feeling that modeling the data is not really working out for you.

YF.glm = glm(fatalities ~ depth+mag, 
family = gaussian,
data = sbux_sf)
temp_data = data.frame(error = rstandard(YF.glm),
x = sbux_sf$X,
y = sbux_sf$Y)
coordinates(temp_data) <- c("x","y")
bubble(temp_data, "error", col = c("black","grey"),
main = "Residuals", xlab = "X-coordinates", ylab = "Y-coordinates")
plot(temp_data$error ~ temp_data$x, xlab = "X-coordinates", ylab = "Errors")
plot(temp_data$error ~ temp_data$y, xlab = "Y-coordinates", ylab = "Errors")
plot(variogram(error ~ 1, temp_data))
plot(variogram(error ~ 1, temp_data, alpha = c(0, 45, 90, 135)))
Then again, if I look at the plot, I find it hard to decipher patterns. Surely, the Fiji island can be recognized, but I do NOT see a clear pattern. A nice way to plot the data but not really informative here. Image by author.
Here a way to plot the error, seperately, on x and y coordinates. The plot above has a more complete picture. Image by author.
And the variogram — in total or at different angles. The 90-degree angle shows something interesting, with a sudden slope down. Image by author.

The angle adds different level of information, since the spatial co-variance might behave differently dependent on the route. You can clearly see that from any of the previous maps made, as well as the bubble plot. Although difficult to decipher, it is easy to see that in a spatial environment, the angle (and thus direction) your are looking at plays a role. Like looking at a map.

And let’s incorporate a spherical correlation matrix in the model.

f1 = fatalities ~ depth+mag
model1 <- gls(f1,
correlation = corSpher(form =~ X + Y, nugget = TRUE),
data = sbux_sf)
plot(model1)
plot(Variogram(model.1))
Looking good. But the variogram could not be plotted. That is not really a good sign, and perhaps I should have asked it to be plotted at different angles. Image by author.

Here, a linear mixed model, including the number of stations used as a explained variance component. I have to admit that is this dataset, none of the variables looked like true variance component that were included to be later filtered out in a mixed model. So I used the single variable that consisted of integers and that had repeated values at the same levels of the number of stations included. So, in the model it works, but biologically I see no merit for including it. Another way how a technical example has little biological worth. At least, to me.

Image by author.
model.1<-glmmPQL(f1, random=~1|stations,
data=sbux_sf,
correlation=corExp(form=~X+Y, nugget = T),
family=gaussian)
plot(Variogram(model.1),
main = "Exponential Correlation")
model.2<-glmmPQL(f1, random=~1|stations,
data=sbux_sf,
correlation=corGaus(form=~X+Y, nugget = T),
family=gaussian)
plot(Variogram(model.2),
main = "Gaussian Correlation")
model.3<-glmmPQL(f1, random=~1|stations,
data=sbux_sf,
correlation=corSpher(form=~X+Y, nugget = T),
family=gaussian)
plot(Variogram(model.3),
main = "Spherical Correlation")
model.4<-glmmPQL(f1, random=~1|stations,
data=sbux_sf,
correlation=corRatio(form=~X+Y, nugget = T),
family=gaussian)
plot(Variogram(model.4),
main = "Rational Quadratic Correlation")
Not really interesting, no matter what you use for a covariance structure. The models becomes increasingly unstable however, which happens if your included so many extra parameters to estimate without any added benefit. Image by author.
f2 = fatalities ~ s(depth)+s(mag)
model2.base <- gamm(f2, method = "REML",
data=sbux_sf,
family=gaussian)
model2.1 <- gamm(f2,
method = "REML",
data=sbux_sf,
correlation=corExp(form=~X+Y, nugget = T),
family=gaussian)
model2.1$lme
model2.1$gam
par(mfrow = c(1, 2))
plot(model2.1$gam)
plot(model2.1$lme)
predict_gam(model2.1$gam)%>%
ggplot(., aes(x=depth,
y=mag,
fill=fit)) +
geom_tile()+
scale_fill_viridis_c(option="magma")+
theme_bw()
A GAMM models reveals a deep relationship with magnitude and fatalities, but not with depth and fatalities, nor with depth and magnitude (which we did not really model but still). The right plot clearly shows the main effect of magnitude. Image by author.
f3 = fatalities ~ s(depth) + s(mag) + ti(depth,mag)
model3.base <- gamm(f3,
method = "REML",
data=sbux_sf,
family=gaussian)
par(mfrow = c(1, 3))
plot(model3.base$gam)
model3.rand <- gamm(f3,
random = list(stations=~1),
method = "REML",
data=sbux_sf,
family=gaussian)
model3.1 <- gamm(f3,
method = "REML",
data=sbux_sf,
random = list(stations=~1),
correlation=corExp(form=~X+Y, nugget = T),
family=gaussian)
predict_gam(model3.1$gam)%>%
ggplot(., aes(x=depth,
y=mag,
fill=fit)) +
geom_tile()+
scale_fill_viridis_c(option="magma")+
theme_bw()
predict_gam(model3.1$gam) %>%
ggplot(aes(depth, fit)) +
geom_smooth_ci()+theme_bw()
predict_gam(model3.1$gam)%>%
ggplot(aes(depth, mag, z = fit)) +
geom_raster(aes(fill = fit)) +
geom_contour(colour = "white") +
scale_fill_continuous(name = "y") +
theme_minimal() +
theme(legend.position = "top")
Same as the single-order GAMM, but we also included an interaction. Image by author.
The interaction we included is of the sort — need to show something but it does not make a lot of sense. Image by author.
And here’s another great way of showing that adding something is actually adding nothing. Image by author.

And with that last sentence, this short introduction comes to a close. Spatial Regression is ‘spatial’ because it tries to use known distance measures like longitude and latitude, although different variables could be included of course. For instance, data coming from clustering technique (which also use metrics to assess ‘distance’) could be included to form a spatial covariance matrix. Longitudinal data consisting of time-points also use distances, but here, the connection is much more flat and so added parameters like angle do not play.

The fact that you have spatial data does not mean that modeling it, or including it in your model, will make your life easier. In the data seen above, the quakes do not depict that much variance as can be seen in the pictures. It is a ‘flat’ picture with little geo-spatial influence. That is why the variogram is also flat — it does not really matter that much were you are in Fiji, besides, perhaps, a tiny spot. That’s it.

With these kinds of examples, you can always try and go for one that perfectly shows what the technique adds, but I did not change that which I could not foresee. The reason being that data analysis always look very nice in most examples, but rarely is THAT exciting.

I hope you enjoyed it. Let me know if something is amiss!

--

--

Scientist. Builder of models, and enthousiast of statistics, research, epidemiology, probability, and simulations for 10+ years.