Spatial Regression Using Fabricated Data
Another dimension of modelling in R
There are numerous posts already available on spatial regression modelling, dealing with geodata, and making fancy plots of data based on location coordinates. For me, I had no intention to post on spatial regression and was actually busy exploring ways to easily simulate hierarchical data using the fabricatr package. That brought me to an example containing spatial data and further exploring ways of dealing with data containing a spatial covariance matrix. And so, here I am, writing about it.
It’s not a comprehensive post, but it does show some of the key elements of spatial regression. And why fancy techniques do nothing if the data does not really contain anything. Let’s see and explore!
library(fabricatr)
library(lattice)
library(sf)
library(mapview)
library(data.table)
library(tidyverse)
library(sp)
library(gstat)
library(geoR)
library(viridis)
library(ggExtra)
library(grid)
library(gridExtra)
library(GGally)
library(MASS)
library(mgcv)
library(tidymv)
library(sjPlot)
library(lme4)
library(wakefield)
library(simsurv)
library(forecast)simulated_quake_data <- fabricate(
data = quakes,
fatalities = round(pmax(0, rnorm(N, mean = mag)) * 100),
insurance_cost = fatalities * runif(N, 1000000, 2000000))
head(simulated_quake_data)
simulated_quake_data<-simulated_quake_data%>%
group_by(lat, long)%>%
mutate(mag=mean(mag),
fatalities=mean(fatalities),
insurance_cost=mean(insurance_cost))%>%
distinct(., lat,long, .keep_all= TRUE)
table(duplicated(simulated_quake_data[,c(1,2)]))
So, the the dataset that brought me where I am now is the quake dataset which you can augment using the fabricatr package. It looks like this:
Now, one of the things no spatial model likes is duplicate location data so you need to get rid of it.
And then it is off to some fancy graphics showing where the data is actually coming from. It shows clearly the earthquakes in the Fiji region. Too bad the data does not have a temporal component. That would make it even more interesting to model.
mapview(simulated_quake_data,
xcol = "long",
ycol = "lat",
crs = 4269,
grid = FALSE)
sbux_sf <- st_as_sf(simulated_quake_data,
coords = c("long", "lat"),
crs = 4326)
plot(st_geometry(sbux_sf))
sqbin <- ggplot() +
geom_bin2d(
data=as.data.frame(st_coordinates(sbux_sf)),
aes(x=X, y=X))+theme_bw()
hexbin <- ggplot() +
geom_hex(
data=as.data.frame(st_coordinates(sbux_sf)),
aes(x=X, y=Y)) +
scale_fill_continuous(type = "viridis")+theme_bw()
grid.arrange(sqbin, hexbin, ncol=2)
ggplot(sbux_sf, aes(x=mag))+
geom_histogram(bins=50, aes(y=..density..))+
geom_density(fill="#FF6666", alpha=0.5, colour="#FF6666")+
theme_bw()
ggplot(data = sbux_sf) +
stat_density2d_filled(
data = as.data.frame(st_coordinates(sbux_sf)),
aes(x = X, y = Y, alpha = ..level..),
n = 16) +
scale_color_viridis_c() +
theme_bw() +
labs(x="Longitude", y="Latitude")+
geom_sf(alpha=0)+
theme(legend.position = "none")
And here, another way to connect the dots on the data and use latitude and longitude coordinates to build geometric points and model, predict, and plot plot the data. The model used applies inverse distance weighting although numerous other methods can be used, like Kriging for instance.
sd.grid <- sbux_sf %>%
st_bbox() %>%
st_as_sfc() %>%
st_make_grid(
n = 100,
what = "centers"
) %>%
st_as_sf() %>%
cbind(., st_coordinates(.))
idw.hp <- idw(
mag ~ 1,
locations = sbux_sf,
newdata=sd.grid,
nmax = 150)
idw.hp = idw.hp %>%
cbind(st_coordinates(.))
g1<-ggplot(idw.hp, aes(x = X, y = Y, fill = var1.pred)) +
geom_raster() +theme_bw()+theme(legend.position = "none")
g2<-ggplot(idw.hp, aes(x = X, y = Y, fill = var1.pred)) +
geom_raster() +
scale_fill_viridis_b() +
theme_void() +
geom_sf(alpha=0)+theme(legend.position = "none")
grid.arrange(g1,g2)
mapview(sbux_sf, map.types = "Stamen.Toner")
coordinates(simulated_quake_data) <- c("long", "lat")
proj4string(simulated_quake_data) <- CRS("+proj=longlat +datum=WGS84")
lzn.vgm <- variogram(log(mag) ~ lat + long, simulated_quake_data, width=0.1)
plot(lzn.vgm)
lzn.fit = fit.variogram(lzn.vgm,
vgm(c("Gau", "Sph", "Mat", "Exp")),
fit.kappa = TRUE,
fit.method= 1)
plot(lzn.fit, 1200)
I think we have now had our fair share of maps and views of the data at hand. Time to move into variograms. For those who do not know what a variogram is — it’s a way to combine the distances between locations and their (co-variance). A spatial co-variance matrix. The trick is to build a plot that shows you at which distance the variances between points no longer matter. In other words, when they are no longer connected. Just like there are different types of non-spatial covariance matrixes, there are different types of spatial co-variance matrices. In fact, the term spatial does not really mean anything since any covariance matrix is determined to connect variance and covariance between adjacent and further-away points. Spatial here just means that you will include location data. Nothing more.
Above, you see two models, the Nug and the Sph which means Nugget and Spherical. The Spherical part is just the way the co-variance matrix is build up, like autoregressive, toeplitz of unstructured. Later on, I will show you different spatial covariance matrices.
The nugget is the most interesting. Nugget means ‘starting point’ — so where do you want your spatial covariance matrix to start, in terms of height. Such a matrix does not have to include a nugget.
Let’s start exploring that kind of data. Many packages exist, but I will end up with my favorite which is the nlme package. Many other packages call upon that package to extend it.
simulated_quake_data <- fabricate(
data = quakes,
fatalities = round(pmax(0, rnorm(N, mean = mag)) * 100),
insurance_cost = fatalities * runif(N, 1000000, 2000000))
simulated_quake_data<-simulated_quake_data%>%
group_by(lat, long)%>%
mutate(mag=mean(mag),
fatalities=mean(fatalities),
insurance_cost=mean(insurance_cost))%>%
distinct(., lat,long, .keep_all= TRUE)
EC97 <- as.geodata(simulated_quake_data,
coords.col = 1:2,
data.col = 4)
Var1.EC97 <- variog(EC97, trend = "1st")
Var2.EC97 <- variog(EC97, trend = "2nd")
plot.new()
par(mfrow = c(2, 1))
plot(Var1.EC97, pch = 19, col = "blue", main = "1st order variogram")
plot(Var2.EC97, pch = 19, col = "red", main = "2nd order variogram")
plot.new()
par(mfrow = c(1, 1))
plot(Var1.EC97, pch = 19, col = "blue", main = "Variogram Simulated Quake Data")
par(new = TRUE)
plot(Var2.EC97, pch = 19, col = "red", xaxt = "n", yaxt = "n")
ini.vals <- expand.grid(seq(0, 30, by = 1), seq(2,30, by = 1))
ols <- variofit(Var1.EC97, ini = ini.vals, fix.nug = TRUE, wei = "equal")
summary(ols)
wls <- variofit(Var1.EC97, ini = ini.vals, fix.nug = TRUE, wei = "npairs")
lines(wls)
lines(ols, lty = 2, col = "blue")
Let’s try and model the data differently by focusing on fatalities and not magnitude. Here, you see my building a simple linear model on coordinate data. The variogram follows naturally.
Model1=lm(fatalities~depth+mag,
data=sbux_sf)
sbux_sf$residuals=residuals(Model1)
sbux_sf<-sbux_sf%>%cbind(., st_coordinates(.))
Vario_res = variogram(residuals~X+Y,
data=sbux_sf,
cutoff=100,
alpha=c(0,45,90,135))
plot(Vario_res)
Let’s try again, although I am pretty sure the results will be the same. We can build a general linear model coming from the nlme package and ask it to build a spherical covariance matrix.
modSpher = gls(fatalities~depth+mag,
data=sbux_sf,
correlation=corSpher(c(100),
form=~X+Y,
nugget=F))
VarioSpher_raw = Variogram(modSpher,
form =~ X+Y,
robust = TRUE,
maxDist = 100,
resType = "pearson")
VarioSpher_normalized = Variogram(modSpher,
form =~X+Y,
robust = TRUE,
maxDist = 100,
resType = "normalized")
plot(VarioSpher_raw)
plot(VarioSpher_normalized)
Here, another example to guild a glm model, without spatial covariance matrix, and save its output to build a bubble plot of residuals plotted on the coordinates of the original data. It’s actually quite a handy way to plot spatial data when you get the feeling that modeling the data is not really working out for you.
YF.glm = glm(fatalities ~ depth+mag,
family = gaussian,
data = sbux_sf)
temp_data = data.frame(error = rstandard(YF.glm),
x = sbux_sf$X,
y = sbux_sf$Y)
coordinates(temp_data) <- c("x","y")
bubble(temp_data, "error", col = c("black","grey"),
main = "Residuals", xlab = "X-coordinates", ylab = "Y-coordinates")
plot(temp_data$error ~ temp_data$x, xlab = "X-coordinates", ylab = "Errors")
plot(temp_data$error ~ temp_data$y, xlab = "Y-coordinates", ylab = "Errors")
plot(variogram(error ~ 1, temp_data))
plot(variogram(error ~ 1, temp_data, alpha = c(0, 45, 90, 135)))
The angle adds different level of information, since the spatial co-variance might behave differently dependent on the route. You can clearly see that from any of the previous maps made, as well as the bubble plot. Although difficult to decipher, it is easy to see that in a spatial environment, the angle (and thus direction) your are looking at plays a role. Like looking at a map.
And let’s incorporate a spherical correlation matrix in the model.
f1 = fatalities ~ depth+mag
model1 <- gls(f1,
correlation = corSpher(form =~ X + Y, nugget = TRUE),
data = sbux_sf)
plot(model1)
plot(Variogram(model.1))
Here, a linear mixed model, including the number of stations used as a explained variance component. I have to admit that is this dataset, none of the variables looked like true variance component that were included to be later filtered out in a mixed model. So I used the single variable that consisted of integers and that had repeated values at the same levels of the number of stations included. So, in the model it works, but biologically I see no merit for including it. Another way how a technical example has little biological worth. At least, to me.
model.1<-glmmPQL(f1, random=~1|stations,
data=sbux_sf,
correlation=corExp(form=~X+Y, nugget = T),
family=gaussian)
plot(Variogram(model.1),
main = "Exponential Correlation")
model.2<-glmmPQL(f1, random=~1|stations,
data=sbux_sf,
correlation=corGaus(form=~X+Y, nugget = T),
family=gaussian)
plot(Variogram(model.2),
main = "Gaussian Correlation")
model.3<-glmmPQL(f1, random=~1|stations,
data=sbux_sf,
correlation=corSpher(form=~X+Y, nugget = T),
family=gaussian)
plot(Variogram(model.3),
main = "Spherical Correlation")
model.4<-glmmPQL(f1, random=~1|stations,
data=sbux_sf,
correlation=corRatio(form=~X+Y, nugget = T),
family=gaussian)
plot(Variogram(model.4),
main = "Rational Quadratic Correlation")
f2 = fatalities ~ s(depth)+s(mag)
model2.base <- gamm(f2, method = "REML",
data=sbux_sf,
family=gaussian)
model2.1 <- gamm(f2,
method = "REML",
data=sbux_sf,
correlation=corExp(form=~X+Y, nugget = T),
family=gaussian)
model2.1$lme
model2.1$gam
par(mfrow = c(1, 2))
plot(model2.1$gam)
plot(model2.1$lme)
predict_gam(model2.1$gam)%>%
ggplot(., aes(x=depth,
y=mag,
fill=fit)) +
geom_tile()+
scale_fill_viridis_c(option="magma")+
theme_bw()
f3 = fatalities ~ s(depth) + s(mag) + ti(depth,mag)
model3.base <- gamm(f3,
method = "REML",
data=sbux_sf,
family=gaussian)
par(mfrow = c(1, 3))
plot(model3.base$gam)
model3.rand <- gamm(f3,
random = list(stations=~1),
method = "REML",
data=sbux_sf,
family=gaussian)
model3.1 <- gamm(f3,
method = "REML",
data=sbux_sf,
random = list(stations=~1),
correlation=corExp(form=~X+Y, nugget = T),
family=gaussian)
predict_gam(model3.1$gam)%>%
ggplot(., aes(x=depth,
y=mag,
fill=fit)) +
geom_tile()+
scale_fill_viridis_c(option="magma")+
theme_bw()predict_gam(model3.1$gam) %>%
ggplot(aes(depth, fit)) +
geom_smooth_ci()+theme_bw()predict_gam(model3.1$gam)%>%
ggplot(aes(depth, mag, z = fit)) +
geom_raster(aes(fill = fit)) +
geom_contour(colour = "white") +
scale_fill_continuous(name = "y") +
theme_minimal() +
theme(legend.position = "top")
And with that last sentence, this short introduction comes to a close. Spatial Regression is ‘spatial’ because it tries to use known distance measures like longitude and latitude, although different variables could be included of course. For instance, data coming from clustering technique (which also use metrics to assess ‘distance’) could be included to form a spatial covariance matrix. Longitudinal data consisting of time-points also use distances, but here, the connection is much more flat and so added parameters like angle do not play.
The fact that you have spatial data does not mean that modeling it, or including it in your model, will make your life easier. In the data seen above, the quakes do not depict that much variance as can be seen in the pictures. It is a ‘flat’ picture with little geo-spatial influence. That is why the variogram is also flat — it does not really matter that much were you are in Fiji, besides, perhaps, a tiny spot. That’s it.
With these kinds of examples, you can always try and go for one that perfectly shows what the technique adds, but I did not change that which I could not foresee. The reason being that data analysis always look very nice in most examples, but rarely is THAT exciting.
I hope you enjoyed it. Let me know if something is amiss!