The world’s leading publication for data science, AI, and ML professionals.

Spatial Data Science: Spatial Queries

answering every "where" question with SQL; with example

thumbnail, by Rini & Author (2022)
thumbnail, by Rini & Author (2022)

Introduction

We usually do suitability analysis through geo-processes such as buffer, union, and intersects in ArcGIS and Qgis process-by-process. Click on one tool button, create an output and use it as an input for other processes. This results in many scratch files in your cache folder resulting in a wonder of remembering which files to use. For simple suitability analysis like this, we technically can utilise model builder both in ArcGIS and Qgis, but a more elegant method is just articulating our suitability request in a more computer parse-able manner.

Suitability analysis is just a set of requirements that we impose on sets of spatial data, for instance:

I want to search for land to be developed as a farm, so we have this sets of instructions:

  • get an intersection between the un-protected area and arable area
  • from the intersection, query all existing empty-land land use
  • from those identified empty land, filter out those that are not within 1 km of irrigation to ensure water for the farm

There is a way to translate these instructions to query from all of our data sets with a single "enter" click.

SQL – Structured Query Language

For this article context, Structured Query Language (SQL) is a language to articulate our query data parameter into instructions that the database manager understands. We can do more, but for now, this explanation is enough. It is an instruction; a request to the database for it to return us the data that we want. I believe everyone knows what a database is already.

How does SQL look? It looks something like this:

SELECT * FROM land_use

this means "select all columns from land_use shapefile". not shapefile per se, rather, from land_usetable (that contains geographical information). Another example:

SELECT name, length, geometry FROM irrigation WHERE length < 2000

this means "return irrigation line where its length is less than 2000, but only with column Name, Length, and its geometry". This omits other columns besides name, length, and geometry because perhaps the irrigation table contains many unnecessary columns such as its type, its developer, and many more. Returning too many columns results in unnecessary bandwidth and spending a lot of computing resources.

example of SQL in spatial table (source: author, 2022)
example of SQL in spatial table (source: author, 2022)

The Idea

The idea is to formulate query instruction such that it returns the spatial data that satisfy our query. This means that we literally type in the instruction; enabling the future analyst to audit our process and reproduce it. It also helps us to track what we have been doing. Look at some of the examples of spatial queries from PostGIS documentation.

Some examples of spatial queries (start with where):

  • where should I live?
  • where should I open my new store?
  • where are the residential zones that are next to industrial zones?
  • should my office relocate to the advertised property?
  • Where are the endangered animals?
  • If I build a high-speed rail, where does this impact the biodiversity of the region?

Spatial Query

The spatial query starts when we introduce the spatial query functions in our SQL script. These functions differ between software, but I suppose the most popular one is Postgresql with Postgis extension, or so we call (just) PostGIS. I’ll discuss PostgreSQL at the end of this article. Other software also provides such a capability; for example with Google BigQuery or Microsoft SQL Server.

The following script is the official example of SQL from postgis.

WITH city AS 
(SELECT 'Gotham' AS name,  ST_Buffer(ST_Point(0,0), 10) AS geom),
superhero(name,geom) AS (VALUES
('Bat Boy', ST_Point(0.1,0)), 
('Bat Girl', ST_Point(1,1)))
SELECT superhero.name FROM city INNER JOIN superhero ON ST_Contains(city.geom, superhero.geom);

You can see all of the functions and operators in this official reference documentation. The common tools like buffer, union, and clip are there!

Let’s jump into the action

Case: Identifying Endangered Settlements with Spatial SQL

Suppose there is a city that is prone to natural hazards like flooding, but we don’t know which residential area is in danger. We need to formulate mitigation measures for these citizens, but we don’t know for whom.

We have the land use data, and we have the flood danger zone (red hashed polygon). When we overlay the data, we yield this map.

case data (source: author, 2022)
case data (source: author, 2022)

It’s simple enough for us to comprehend, but in actual life, this might be the size of a region or a province; too large for our brain to comprehend.

Here we can see that there are overlaps between the settlements and the danger zone; how do we acquire that area? In other words, where are the settlements that are prone to flooding?

The Common Way

the common way is to

  • Select by attributes on the land use data where land use equals "settlements" (this is actually SQL by the way)
  • with the selection still active, the open clip/intersect tool
  • set the selected land use as input, and the danger zone as the intersecting feature
  • perform the intersect, and save the result in memory or some directory

Now let’s do it with Spatial SQL

The Fancier Spatial SQL

create the SQL script with Spatial SQL function: ST_Intersection() as the following

select ST_Intersection(danger_zone.geom, landuse_demo.geom) as geom FROM danger_zone , landuse_demo where landuse_demo.landuse = settlements

and the result is as the following layer (answering where are the settlements in the danger zone?).

(source: author, 2022)
(source: author, 2022)

Seems like it’s more complicated, so why bother?

There are several major reasons:

  • speed & performance: getting faster in getting jobs done, load the data that you need not everything!
  • reproducibility & Accountability: notes and being literate on what you do
  • versatility: flexible and interoperable with other apps.
  • Accountable: it’s easier to audit script because it is quite the literal form of processes instead of process graph.

it’s a trade off between complexity and flexibility, but it’s an option if the above points are what you are looking for.

let’s discuss the reasons further

Speed and Performance

Every time I execute the SQL, I produce a map. So, switching criteria to produce maps is going to be very fast.

Now I want to switch the criteria from settlements to industrial, I can just edit the end of the script like this

select ST_Intersection(danger_zone.geom, landuse_demo.geom) as geom FROM danger_zone , landuse_demo where landuse_demo.landuse = pastures

notice that I switch the landuse parameter at the end from settlement to industrial. The result is now like this

(source: author, 2022)
(source: author, 2022)

or switch it to industrial like this.

select ST_Intersection(danger_zone.geom, landuse_demo.geom) as geom FROM danger_zone , landuse_demo where landuse_demo.landuse = industrial
(source: author, 2022)
(source: author, 2022)

This is just a simple example to get hold of the spatial SQL idea. In a development environment where automation is at the essence, this kind of scripting is a technique that improves performance. I did not even use QGIS or ArcGIS to produce the maps that you have been seeing. All of these can be done with just a monitor and a keyboard (Provided that you have the data in good quality in a PostGIS server, but hey, it’s a different story for another article).

Postgis SQL in Geopandas, (source: author, 2022)
Postgis SQL in Geopandas, (source: author, 2022)

Executing SQL produces layers without loading the whole dataset. You don’t need to load that 4 GB shapefile then filter it; just type in what you want to see and the layer is in your canvas. Do you do something wrong? just change the query script.

Imagine doing it the old way, where you have to click on the tools again, and when things got complicated, you get lost in the way. Of course, we can do this with the model builder; which is a graphical representation of these kinds of scripts anyway. Of course, there is nothing wrong with this if this works for you; but oftentimes, I don’t like clicking on buttons once I learn how to script.

We also switch the processing job to the server instead of our machine. This is very useful when the data is complicated and large but we don’t have a very powerful computer; although this means we need a more reliable server. but hey, that’s why we have cloud computing!

Versatility, Reproducibility and Accountability

Spatial SQL thus enables versatility and reproducibility. I can give this script to my colleagues, and they will have the same result when they execute it. Colleagues can literally see what we are doing, and edit the script collaboratively in a git workflow environment. From my personal experience, once you get along with SQL and the scripting environment, you become really fast and more accountable for your analysis. When things go wrong, we can trace back what we did and just read the instructions we executed.

It is versatile as we can copy that SQL script to develop many apps. SQL is a general language; we can pass this on javascript web-gis apps such as leaflet/OpenLayers, or pass it as a view from the server-side. Compare this with vague instructions on where to click buttons or steps in a GUI desktop workflow environment. With the desktop environment, you can only do it in the software, the os, and the version of the software. Switch from ArcGIS to QGIS, then the instruction is going to be different; but not with SQL script, it’s going to be the same!

Loading the Query to GIS Layers

It is possible to execute SQL in QGIS and see the result as a layer. If you are using Qgis 3.2 above, there is a tool to do this. the tool’s name is Execute and Load SQL using PostgreSQL as the database engine.

QGIS's SQL prompt screenshot (source: author, 2022)
QGIS’s SQL prompt screenshot (source: author, 2022)

we can type our SQL instruction and the tool returns what we request.

By the way, we have been discussing SQL with PostgreSQL and postgis but what is PostgreSQL?

PostgreSQL and Postgis – RDBMS

PostgreSQL is a database manager. See it like the ArcCatalog in ArcGIS or Browser in Qgis. It is intended to be used for general database purposes but with its extension, Postgis, we can take spatial data capabilities from the engine. If we develop a web-gis app, the common practice is to employ PostgreSQL as the database engine that stores the data. For GIS purposes, people usually just say Postgis instead of PostgreSQL even though PostGIS is not really the engine. It’s just more catchy and elegant!

"do you save your spatial data on shapefile or Geodatabase?"

"no, I store it in my PostGIS server" (:

Managing Spatial data using PgAdmin for PostgreSQL (source: author, 2022)
Managing Spatial data using PgAdmin for PostgreSQL (source: author, 2022)

Conclusion

SQL is how we structure our query such that the computers understand what we ask so that it gives the answer. SQL itself is not exclusive to GIS. It is a general language used in the Data Science community, while spatial data just happens to be ordinary data that contains geographical information like coordinates and elevation. Postgis allows us to account for the interaction between spatial data in our SQL.

However, appreciating spatial interactivity requires one extra step in general data science that is, as I mentioned, usually spatially agnostic. This interactivity can be queried using spatial SQL with Postgis ensuring reproducibility and analysis assurance. It is one way to model spatial data and technique in working with spatial data.


Related Articles