Case Study / Project Walkthrough

This is the third article of a three-part series of articles in Getting started Geographic Data Science with Python. You will learn about reading, manipulating and analysing Geographic data in Python. The third part, which is this article, covers a relevant and real-world project wrapping up to cement your learning.
Articles in this series can be accessed here:
Master Geographic Data Science with Real World projects & Exercises
Getting started with Geographic Data Science in Python – Part 2
Learning Objectives for this case study are:
- Apply spatial operations on real word dataset project
- Spatial join and munging Geographic data.
- Exploratory Spatial Data Analysis (ESDA)
Project setup
In this project, we will use two datasets: a population dataset disaggregated by age and preschools dataset from Statistics Sweden. Since we are dealing with preschools, we will focus on children between the ages of 0 to 5 years. Sweden is considered as third busiest baby-makers in Europe according to recent statistics from Eurostat. In this project, we will analyse the Geographic distribution of the population under the age of 5 as well ass the distribution of preschools.
The population dataset comes is in a low level disaggregated format (voting areas) where each area has 700 to 2700 population. On the other hand, Preschools dataset is in coordinate points format with addresses of preschools in the whole country. After a brief exploratory data analysis of both datasets, we will carry out preprocessing and spatial join geoprocessing tasks to combine both datasets. Here is the road map for counting the number of preschools within each area.
- Contains Spatial Operation: Use spatial join to determine which preschools are within the polygon area or put it in another way which population areas contain the preschool points.
- Group by size to the aggregate number of preschools in each area.
- Merge the group by dataframe with population dataset.
Finally, We will perform Exploratory Spatial Data Analysis (ESDA) to look into the spatial distribution of preschools in different areas.
Let us first read both datasets into Geopandas. We need first to access the datasets from dropbox link and unzip it.
By looking at the first few rows of the population dataset, it is worth noting that the geometry of this dataset is a Polygon. Each area has a unique code (Deso) with a bunch of other attributes. _Age0_5 represents the number of children between the age of 0 to 5 in each area. The total population is stored in Tot_pop_ column.

On the other hand, preschools have Point geometry where each school’s coordinates are stored. Name of the school as well as address, city and post number are available as columns in this dataset.

In the next section, we will have a closer look at some aspects of both datasets.
Exploratory Data Analysis (EDA)
Before we embark on geoprocessing tasks, let us analyze the datasets to summarize main characteristics. Exploring datasets before preprocessing and modelling can help you grasp your dataset better. We start with population dataset which has 5985 rows and 32 columns in total. Let us do a summary of descriptive statistics with .describe()
method. Here is the output of the description statistics of the population dataset.

The following plot shows the distribution of the Age0_5 column in the population dataset.
As you can see from the below distribution plot, it is skewed in the right with a long tail in the right.

Since we are dealing with geographic data we can also visualize maps. In this example, we will construct a Choropleth map of the children between the ages of 0 to 5. First, we need to calculate the children density of the area by calculating _Age0_5 / Tot_pop to create a new Age05_density_ column.
The output map indicates the number of children in each area by colour.

The map clearly shows the distribution of children in these areas. In the north, children density is very low while in the east, west and south where we have a high density of children.
Finally, let us overlay the preschools on the choropleth map and look at the distribution of schools.
The output map indicates the number of children in each area by colour and overlay of preschools dataset.

It is quite messy. We can not know for sure how many preschools are located in each area. This is where spatial join and geoprocessing tasks come handy. In the next section, we will perform several preprocessing tasks to get the number of preschools within each area.
Spatial Join
Count number of preschools within each area. The process contains the following:
- Use spatial join to determine which preschools within the polygon area.
- Group by size to the aggregate number of preschools in each area.
- Merge the group by dataframe with population dataset.
Here is the code to perform the spatial join and create a new GeoDataFrame with the number of schools counted in each area.
So if we now look at the first few rows of _merged_population dataset, we will notice that we have an extra column, preschool_count_ where preschools number in each area is stored.

Now we can examine side by side choropleth maps of age density as well as a new choropleth map based on the preschool count.

However, we can not have a meaningful comparison of these two completely different features. As you can preschool_counts are in the range of 0 –10 while the age_density varies between 0 to 24. In statistics, like the EDA we performed earlier, we assume independence among observations in the dataset, however, with Geographic dataset, there is a strong spatial dependency. Think about the first law of Geography by Waldo Tobler:
"everything is related to everything else, but near things are more related than distant things."
Therefore, in the next section, we dig a little bit deeper and perform an Exploratory Spatial Data Analysis.
Exploratory Spatial Data Analysis (ESDA)
Spatial statistics and Exploratory Spatial Data Analysis (ESDA) is very broad. In this section, we will only look at the Local Indicator of Spatial Association (LISA) to detect spatial autocorrelation in this dataset and explore characteristics of close locations and their correlations. This can help us to study and understand the spatial distribution and structure as well as detecting spatial autocorrelation in this data.
To get a closer look at these spatial statistics, we first read counties dataset, subset our data to the most two populated cities in Sweden, Stockholm and Gothenburg and then merge with our preprocessed population dataset.
We will use Pysal and Splot libraries to carry out spatial autocorrelation. This is the setup and transformation of weights. We take preschool_count as our y variable and perform Queen weights transformation. We then create a spatial lag column (y_lag) using the weights to get similarity of different polygons based on their geographic areas.
Now we can calculate Moran’s I local to detect clusters and outliers.
Let us plot the Moran’s I for both cities.
The output of the above code two scatter plots as shown below. Stockholm has negative spatial autocorrelation while Gothenburg city has positive spatial autocorrelation.

The plots also divide the points into four quadrants to specify types of local spatial autocorrelation into four types: High-High, Low-Low, High-Low, Low-High. The upper right quadrant displays HH, the lower left quadrant, LL, the upper left quadrant LH and the lower left quadrant HL. This can be seen clearly with a map so let us put this into a map.
The following two maps clearly show different clusters in these areas. In Stockholm HH spatial cluster are not concentrated in one place while in Gothenburg we have a large number of adjacent areas of with HH values.

This portrays a much clearer picture than the choropleth map we started and can give you a clear indication of where different spatial clusters are located.
Conclusion
This project demonstrated performing Geoprocessing tasks of Geographic data using Python. We first Carried out Exploratory Data Analysis (EDA) and moved into performing spatial join and create a new dataset. In the final section, we covered the Exploratory Spatial Data Analysis (ESDA) to get a deeper understanding of the Geographic dataset and their spatial distributions.
The code is available in this GitHub repository:
You can also go directly and run Google Collaboraty Jupyter Notebooks directly from this link: