The world’s leading publication for data science, AI, and ML professionals.

Getting started with Geographic Data Science in Python – Part 2

Tutorials, Real World projects & Exercises

Photo by Milada Vigerova on Unsplash
Photo by Milada Vigerova on Unsplash

This is the Second article of a three-part series of articles in Getting started Geographic Data Science with Python. You will learn about reading, manipulating and analysing Geographic data in Python. The articles in this series are designed to be sequential where the first article lays the foundation and the second one gets into intermediate and advanced level Geographic data science topics. The third part covers a relevant and real-world project wrapping up to cement your learning.

The first article can be accessed here.

Master Geographic Data Science with Real World projects & Exercises

Learning Objectives for this tutorial are:

  1. Understand GeodataFrames and Geoseries
  2. Perform Table join and Spatial Join
  3. Carry out the buffer and overlay analysis

1. GeodataFrame & Geoseries

Let us read the countries and cites dataset.

Once you load the data, what we get is a table with geographic geometries. The geographic geometries allow us to perform spatial operations in addition to the typical tabular data analysis in pandas or simple excel.

DataFrame vs. GeoDataFrame.

A GeoDataFrame is a tabular data structure that contains a GeoSeries.The most important property of a GeoDataFrame is that it always has one GeoSeries column that holds a special status. This GeoSeries is referred to as the GeoDataFrame’s "geometry". When a spatial method is applied to a GeoDataFrame (or a spatial attribute like Area is called), this commands will always act on the "geometry" column.

If you have more than one column, you have either a dataFrame or GeodataFrame. If One of the columns is a Geometry Column, then it is called a GeoeDataFrame. Otherwise, it is a DataFrame if any of the columns is not a geometry column. Similarly, One column means you have either a Series or Geoseries data type. If the only column is the Geometry column, then it is called Geoseries. Let us see an example of each data type. We start with Dataframe.

We have only two columns here and none of them is a Geometry column, therefore, the type of this data will be a dataframe and the output of the type function is pandas.core.frame.DataFrame. If we happen to have any geometry column in our table, then it will be a Geodatframe as below.

Similarly, a Geoseries is when we have a single Geometry column and Series datatype will be when this one column is not a geometry column as shown below.

This will yield pandas.core.series.Series and geopandas.geoseries.GeoSeries respectively.

With GeoDataFrame/GeoSeries you can carry out geographic processing tasks. So far we have seen few including .plot() . Another example is getting centriods of polygons. Let us get each country’s centroid and plot it.

And this is how the plot looks like, each point represents the country’s center.

country centroid
country centroid

Exercise 1.1: Create a union of all polygon geometries (Countries). Hint use (.unary_union)

Exercise 1.2: calculate the area of each country. Hint use (.area)

2. Table Join vs. Spatial join

Table joins is classical query operation where we have two separate tables, for example, sharing one column. In that case, you can perform a table join where the two tables are joined using the shared column. On the other hand, spatial join relates to geographic operations, for example, joining by location each city and its country. We will see both examples below.

We can join/merge the two tables based on their shared column NAME. This is pure pandas operation and does not entail any geographic operations.

However, in spatial join, the merging entails a geographic operation. We will perform an example of a spatial join. We want to join the following two tables based on their locations. For example, which country does contain which city or which city is within which country. We will use Geopandas function .sjoin() to do the spatial join and show a sample of 5 rows.

As you can see from the below table, each city is matched with its corresponding country based on the location. We have used op=within which takes city points that are within a countries polygon. Here we could also use intersect. Also, we could use op=contain and find out which countries contain the city points.

spatial joined table
spatial joined table

3. Buffer Analysis

Buffer analysis is an important geoprocessing task. It is used widely in many domains to get a distance around a point. In this example, we will first get a city in Sweden and then do a buffer around it. One tricky thing here is you need to know which CRS/projection you are using to get the correct output you want. If your data is not projected into projection where meters are used, then the output will not be in meters. This is a classical error in the world of Geodata. I have used this resource to find out which CRS Sweden has in meters.

SWEREF99 TM: EPSG Projection — Spatial Reference

We use here 3 different buffer distances, 100, 200, and 500 on a single point, Stockholm city. Then we plot the result to show the concept of buffering.

buffer Stockholm city example
buffer Stockholm city example

Exercise 3.1: Create a buffer of all cities. Try different projections and different distances.

Overlay

We sometimes need to create new features out of different data types like Points, Lines and Polygons. Set operations or Overlays play an important role here. We will be using the same dataset but instead of reading it from our unzipped folder we can use built-in dataset reading mechanism in Geopandas. This example comes from Geopandas documentation.

We can subset data to select only Africa.

Africa
Africa

To illustrate the overlay function, consider the following case in which one wishes to identify the "core" portion of each country – defined as areas within 500km of a capital – using a GeoDataFrame of Africa and a GeoDataFrame of capitals.

To select only the portion of countries within 500km of a capital, we specify the how option to be "intersect", which creates a new set of polygons where these two layers overlap:

Africa core overlay
Africa core overlay

Changing the "how" option allows for different types of overlay operations. For example, if we were interested in the portions of countries far from capitals (the peripheries), we would compute the difference between the two.

Conclusion

This tutorial covered some geoprocessing task in Geographic data using Geopandas. First, we studied differences between dataframe and Geodataframe followed by exploring spatial join. We have also done buffer analysis as well as Overlay analysis. In the next tutorial, we will apply what we have learned in this and preceding part in a project.

The code is available in this GitHub repository:

shakasom/GDS

You can also go directly and run Google Collaboraty Jupyter Notebooks directly from this link:

shakasom/GDS


Related Articles