How to Learn Geospatial Data Science in 2023

A step-by-step guide for those wanting to learn geospatial data analysis with Python

Maurício Cordeiro
Towards Data Science

--

Digital image created with Dall-E 2. Caption: Photo of a scientist holding the Earth.

Introduction

Why is this topic relevant? It’s no secret that data science careers are in high demand these days. And when you add the dimension of geospatial analysis to the mix, the possibilities become even more exciting. Climate change, food production, and transitioning to a carbon-free economy are just a few of the many important issues that require a deep understanding of geospatial data. By combining satellite and drone imagery, vector datasets, and field measurements, we can gain deeper insights and drive meaningful changes.

While many resources are available for learning geospatial data analysis with Python, the field is rapidly evolving. In the past, I managed a GIS team where we relied heavily on software such as ArcGIS or ENVI for mapping water resources and conducting hydrological analyses. However, my journey with geospatial analysis using Python began in 2019 when I started my Ph.D. in the subject.

This week I stumbled upon the book “Python Geospatial Development.” Initially, I thought it would be an excellent opportunity to enhance my skills. Still, after examining the table of contents, I was surprised. Apart from GDAL (which I used briefly at the beginning of my journey, but quickly replaced with Rasterio), I hardly use any Python libraries discussed in Chapter 3 — Python Libraries for Geospatial Development. This disparity can be attributed to the fact that the book was first published in 2010, with only minor updates in 2013 and 2016. In the world of computer science, that’s a long time.

Although GDAL is still required for installation due to its drivers, its Python bindings are no longer necessary and can be confusing to use (to put it mildly). Additionally, I’ve never heard of other libraries such as Mapnik. Furthermore, essential packages such as GeoPandas or XArray are not even mentioned in the book. Finally, the Google Earth Engine (a real game changer in this subject) didn’t even exist at the time of the book's publication, highlighting a common issue with books for practical development — they can quickly become outdated.

Well, I’m not advocating against books. They are still important for building a solid foundation in a particular subject. However, to become a practical Geospatial Analyst, it is crucial to learn how to navigate and make sense of the vast amount of online information, which can be overwhelming for those new to the field.

With this in mind, this article is written for those wanting to learn geospatial data analysis with Python in 2023 and provides a quick guide on the primary skills and topics to focus on, as well as some tips on how to effectively navigate the vast amount of information available online and avoid some traps. Note that this guide is based on my own experience and may not necessarily fit everyone's expectations.

Avoid the traps

Before starting, you have to be aware of some traps. I’ve listed just four traps to watch out for, but I’m sure there are more out there that you may encounter during your learning journey:

1- Overwhelming Information: The world of geospatial data science is vast, and it can be challenging to know where to begin. If you start searching the internet, you will be daunted by the amount of information, articles, and courses available on each subject. I know it's overwhelming. Suddenly, you find yourself just zapping from one site to another and spending an entire day without actually learning anything. To avoid feeling daunted, focus on one topic at a time and avoid distractions.. Use your preferred method: pomodoro, Flowtime ,or Deepwork. It doesn't matter, as long as you STAY FOCUSED!

Photo by Usman Yousaf on Unsplash

2- Outdated resources: This problem is related to what I mentioned in the introduction… if you get an outdated source, you may be just wasting your time and going nowhere. Or, even worse, you may be going somewhere in the…. PAST!

Note: This is not so relevant if you are studying some foundations such as statistics or physics, but it can play an important hole if you are covering some technological aspect.

3- Respect your learning curve: The third problem that happened to me when I started was that I tried to tackle advanced topics, such as creating a virtual machine in the cloud to host a map server, without even understanding what Docker or Git were. As a result, I spent a lot of time trying to solve issues that were out of my reach. For example, nowadays, when I want to publish a map, I just publish a COG (Cloud optimized Geotiff) directly on Github (e.g., https://cordmaur.github.io/Fastai2-Medium/occurrence_map.html) , without the hustle of setting map servers, etc. But I only knew about COGs after some time studying the subject.

Figure 1: COG file hosted directly on Github. Image by author.

4- Don't try to understand everything from the ground up: Here, we must use some reasoning. It’s essential to grasp the main topics and promptly catch up on any important concepts that may be missing. But you should do that punctually. For example, suppose you are running a Machine Learning model that uses a derivative in the process. If you are like me, you will be tempted to search for the best Calculus course available, even though you had that at college for two years in a row. Please, don’t fall into this trap; it’s impossible to understand all underlying subjects deeply. Instead, perform a quick review of the topic to keep you on track.

How to Start

As mentioned earlier, mastering everything that is needed in geospatial data science can be overwhelming and take time. If you try to learn everything in the "correct" sequence, you may become discouraged and give up. In this regard, I agree with Jeremy Howard from fast.ai, who suggests taking a top-down approach. This way, you can see meaningful results even if you don't fully understand everything that lies underneath.

So that's the "fuzzy" order I would consider learning if I had to start from scratch and learn as quickly as possible.

Digital image created with Dall-E 2. Caption: Photo of a teddy bear climbing a stair with the Earth.

1- Basic Python Programming

That's an essential topic. Without it, it's pointless to learn about vectorial or raster datasets, for example, because everything will be abstract, and you will not have the tools to explore it. I know you can use QGIS or other GIS software. Still, it is different from opening the raw data into a numpy array or a geojson file and exploring it by yourself. Besides that, learning Python programming in the first place can open other doors that you had not imagined beforehand.

Note: Keep in mind that, in the beginning, you will not be writing programs. You will be just relying on third-party libraries to perform your analysis. So, it's not advisable to take a resource designed for programmers at this point. You can read the post “How to Learn Python for Data Science the Right Way” to understand the difference.

So, to learn just the bare minimum, I suggest a course like Codecademy’s ‘Learn Python 3’. It can be completed in just 25 hours, in less than one week, and provides a quick overview of the language. Additionally, I’ve published four lessons of a course called ‘Introduction Python for Scientists’ on my YouTube channel, which can also be an excelent introduction to the topic.

Digital image created with Dall-E 2. Caption: Pencil drawing of a toddler programming on a vintage computer.

After that, I suggest studying the Python Data Science Handbook, by Jake VanderPlas, available for free on the author's GitHub (https://jakevdp.github.io/PythonDataScienceHandbook/). This book is not focused on teaching the Python language’s inner workings, such as creating complex functions, defining classes, inheritance, or other object-oriented aspects. Instead, it provides a quick overview of the main libraries, such as numpy, pandas, and matplotlib, that are essential tools for data analysis.

2- GEE and GIS Fundamentals

Now that you have a basic understanding of Python and its main libraries for data analysis, it’s time to get your hands dirty with some GIS concepts. Taking a generic GIS course may result in being bombarded with information about how satellite data is collected, the path of light within the atmosphere, and how it affects sensor readings. While these subjects are important if you are doing research, they may not be necessary for a practical introduction to geospatial data analysis.

Photo by Krzysztof Hepner on Unsplash

From a practical perspective, I suggest learning the fundamentals of GIS via a hands-on course on Google Earth Engine that uses Python. This way, you’ll practice Python and have quick access to geospatial data (both vector and raster) without worrying about downloading or setting up complex environments. You’ll start performing your first spatial analyses right away.

Such courses are available on Udemy: Spatial Data Analysis in Google Earth Engine Python API or The Complete Google Earth Engine Python API & Colab Bootcamp. Another option is taking a complete GEE course that will be using the JavaScript environment, but everything can be replicated using Python, with minor changes in the syntax.

3- Geospatial Python Libraries

Google Earth Engine (GEE) is powerful and provides tons of ready-to-use data, but it also has some shortcomings. Everything must run in the Google cloud. While it provides free access to its resources, it can also incur costs, especially for large-scale processing. So, the ideal solution is to use open Python libraries to work with any data from any source. The main libraries are:

  • rasterio: will be used for reading raster data and loading it on arrays.
  • geopandas: similar to pandas, but with a spatial field for vectorial data.
  • xarray: similar to numpy but aware of coordinates and scales.
  • shapely: for vectorial processing, clipping, intersection, etc.
  • fiona: for reading standard shapefiles.
  • leafmap: for dynamics map visualization.

4- Managing Projects and Python Environments

Suppose you’ve reached this point without previous knowledge of Python programming. In that case, you likely have a messy working environment whose creation you don’t fully understand, along with a collection of disconnected Jupyter notebooks, making it challenging to locate that important code snippet you discovered the week before. If you’re in a similar situation, don’t panic! It happens to every newcomer. Now it's a good time to take a step back and organize your working environment. To do this, you need to understand:

  • The importance of package management tools such as Pip and Conda. You can also explore Miniconda and Mamba for faster performance;
  • How to create a kernel for your Jupyter notebooks;
  • How to controle the versioning your work using Git and back it up with GitHub.

5- Deep Dive into Python Programming

Up to now, you’ve probably been using Jupyter notebooks for your experiments. It’s an excellent data exploration and analysis tool, but don't expect to create a fully functional piece of software with it (unless you are adopting nbdev). Besides that, writing a notebook that performs a specific analysis differs from writing a fully operational software that will be deployed to a server, for example. I discussed this issue in a previous article called "Why Data Scientists Should use Jupyter Notebooks with Moderation".

Basically, you will need some IDE (PyCharm or VSCode) to develop the functions, classes, and packages. The Jupyter notebooks will be used to call what you’ve created and display the results. Remember the first item in the list, when I mentioned that it was not to use a Python resource meant for programmers? Well, now it's time to get those strong programming skills.

6- Advanced Topics (the sky is the limit)

Moving forward, you’ll soon realize that many tasks in the geospatial analysis are easily scalable, depending on the amount of available data and processing power. However, to scale up, you’ll need some knowledge on how to deploy a cloud server with more resources and how to do distributed computing. In these scenarios, it’s important to familiarize yourself with new technologies such as:

  • Dask Library: Dask is a Python library for distributed computing. It can save you when you must process large amounts of data that won't fit in your base memory.
  • Docker: Docker permits you to build, package, and deploy environments and applications as containers. These containers can be deployed to cloud servers for improved efficiency and scalability.
Digital image created with Dall-E 2. Caption: Cowboy riding a rocket to the moon

7- Math and Statistics for Data Science

Are you sure? Isn't it the foundation of everything, you may be thinking? Well, this topic is closely related to trap #4. If you have a good background in math, you should only pick the topics that you are actually having issues with, punctually. For that, the Khan Academy (https://en.khanacademy.org) is a great free source, and the lectures are split into short videos, around 10 minutes each.

If you think your problem is on Statistics only, you can consider the book "Probability and Statistics for Computer Science" from David Forsyth, published by Springer.

On the other hand, if you don't have strong foundation in Math, you should consider buying a course focused on Math for Data Science. There are plenty of those available on platforms such as CodeAcademy (Fundamental Math for Data Science) or Udemy (Math for Data science,Data analysis, and Machine Learning). They are not expensive and could give you a good overview.

Conclusion

In conclusion, I believe that geospatial data science is one of the most exciting and important fields of our time. By combining the power of Python with the vast amounts of geospatial data available, we have the potential to drive impactful changes in areas such as climate change, food production, and public health. However, to become a skilled practitioner in this field, it’s important to focus on the right topics and tools. In my experience, this means covering the following steps:

1- Basic Python Programming

2- GEE and GIS Fundamentals

3- Geospatial Python Libraries

4- Managing Projects and Python Environments

5- Deep Dive in Python Programming

6- Advanced Topics (the sky is the limit)

7- Math and Statistics for Data Science

During this journey, you should also be aware of common traps and pitfalls that can slow down your progress. With these skills and resources at your disposal, I am confident that anyone can learn geospatial data science and make a real difference in the world.

Stay Connected

If you liked this article, consider becoming a Medium member and unlock thousands of articles like this one. It costs only $5/month.

--

--

Ph.D. Geospatial Data Scientist and water specialist at Brazilian National Water and Sanitation Agency. To get in touch: https://www.linkedin.com/in/cordmaur/