Making Sense of Big Data

Displaying a gridded dataset on a web-based map

Step by step guide for displaying large GeoTIFFs, using Holoviews, Bokeh, and Datashader

Steve Attila Kopias
Towards Data Science
23 min readApr 17, 2021

--

(Image by the author, using OSM, Holoviews, SRTM elevation data)

The problem

Your boss/client just found you with a new task. They have some geographical information in a GeoTIFF file, and you have to “put it on a map like google maps, you know, with the streets and zooming and scrolling around” and then make it available on the web. The visualization should make the data available at first glance as an actual colorful image. Additionally, on mouse hover, the user should also see the actual data value at that location with the exact coordinates. Also, the GeoTIFF file is pretty large (up to a few hundred or thousand megabytes), so loading all of it into a browser is not an option.

You’re not worried as even though you never tried anything like this. It sounds like a pretty common situation, so the net should be full of out-of-the-box solutions and working examples. Except it isn’t. There isn’t a single example or tutorial showcasing anything like this to the best of my knowledge. There are various existing approaches, and we will discuss them at the end of this article. But all of them fall short compared to the specification above.

The solution

I’m going to show you how to use Rioxarray to read the data file, then Holoviews + Bokeh to display the results on an image layer overlaid on a map, and Datashader to handle some automatic resizing magic. The target audience for this article is anybody who has to do this job without assuming any prior knowledge other than some essential experience with Python. Some of you may have extensive GIS, web development and data visualization background, while others don’t. If you’re looking for the code only, scroll down to the “Loading and understanding the data” section.

The basics

The task is to put the data in the GeoTIFF file somehow onto a tile-based map. First, let’s see what we know about these things themselves.

1. What is a GeoTIFF file?

A GeoTIFF file (if awfully simplified) is like an excel sheet where every cell (pixel) contains a number, while the names of the rows and columns are coordinates. The values could be anything: population density map of Munich, satellite night light map of Malaysia, temperature records in Texas, or for the sake of this post, an elevation map of Kenya.

Knowing the coordinates means that we know precisely where this data grid goes on a map: not just the corners but also every pixel in it. It’s important to note that the file only contains the raw data; it doesn’t have any color information. So while a classical image file (like a bmp, jpeg, or png) would contain the information for the top-left pixel that it is “red”, a GeoTIFF file will contain a numerical value (like it has an elevation of “123” in meters). It’s up to us to tell our visualization system what color corresponds with each value.

Most image viewer applications can open and show TIFF files using a default color scale, usually assuming the lowest values should be black, the highest values white, and everything in-between various shades of gray. But still, it’s important to remember that TIFFs themselves are not colorful, they only contain the data, and we need to color them by using colormaps.

An example of how image-viewer applications usually display TIFF files with a default grayscale colormap. (Image by the author)

One other important thing to consider is the file’s resolution, meaning the size of the area one pixel covers. There can be two similar GeoTIFF files, both covering the same total area, one with 50x50 meter pixels and the other with 5000x5000 meter pixels. It’s easy to understand that the first one needs 100*100=10,000 times more pixels to cover the same area as the second one. Thus it’s going to be way bigger in file size but going to give us more detail. The second one would be a way smaller file, but only having one data point for every 25 square kilometers. In this exercise, we are going to use exactly these two resolutions. We start with a low-resolution prototype, and we switch to the high-resolution dataset as soon as we learn how to use that without freezing everything.

The difference between low and high-resolution data. (Image by the author using OSM, Holoviews, SRTM elevation data)

In reality, there is a lot more magic happening in a GeoTIFF. For the sake of simplicity, we are going to assume our GeoTIFF contains one “band” (like an excel file containing one sheet). Also, we are not caring about the inner details of its cloud-optimized image pyramids or the exact coordinate reference systems. However, I could highly recommend reading up on those just for the fun of it.

2. What is a tilemap?

A tile-based map is something that we are all familiar with from Google Maps. The user has an interactive viewport they can drag around, zoom in and out, and the map provider server sends them appropriately sized map tiles to view and also reasonably detailed information at that zoom level. While integrating Google Maps into your application would need a developer key and a paid subscription, there are many free map providers we can use. Thankfully, most charting libraries make them easy to import.

The important thing here is not to confuse tilemaps with the “drawn maps” (also called geo maps, geojsons, vector maps, shapefiles depending on the context and format). The latter are primarily static outlines of countries or other areas and have none of the functionalities mentioned here.

The area of Kenya displayed on a tile-based map and as a vector map. (Image by the author using OSM)

3. Tools used

Rioxarray (with its dependencies xarray and rasterio) is a fairly capable module for handling gridded geographical datasets like a GeoTIFF, so we will use it to open and convert our data, but it is not necessary to know a lot about it.

The Holoviz ecosystem, however (that contains Holoviews, Bokeh, Datashader, and more) is a much more exciting entity. It is an open-source set of tools financed by, among others, Nvidia and Anaconda.

Bokeh is a charting tool: it lets you create interactive web charts in Python and automatically displays them in the browser through its BokehJS library in JavaScript. It’s somewhat similar to Plotly if you’re familiar with that. One of the main differences is that Bokeh itself has a built-in web server, so it can create deeply interactive charts where a user-interaction in the browser (like dragging a slider) gets sent to the server where Bokeh modifies the data and sends back the result seamlessly. (Plotly uses Dash for the same effect.) This is what’s going to happen in this notebook too: every time we create an interactive visualization, a Bokeh webserver will spin up in the background and serve it to us without us having to worry about it.

Holoviews sits a few steps higher in the hierarchy. While in Bokeh you have to create all your charts manually (even if there are many sensible defaults and helper functions), Holoviews wants to give you a tool to just describe your data with your desired visualization and then let a plotting module (like Bokeh) do the menial work. The difference is a bit like between cooking, where you have to deal with every detail during the process and going to a restaurant just describing the food you want, letting the staff make it without having to deal with the details. Although with Holoviews, you can get access to the kitchen (the plotting module) too if you want to do a few steps precisely the way you need it.

Holoviews also lets you use various plotting libraries. We will let it use Bokeh here as we will need some specific functionality. In many other cases, it’s just the matter of switching one keyword to “plotly” or to “matplotlib”, and you get the exact same chart created in those systems.

The same code displayed in Matplotlib, Bokeh, and Plotly with default settings. (Image by the author.)

Holoviews is great at creating really high-level functionality in just a few lines of code as we are going to see.

Datashader, on the other hand, does just a few things but implements them well. Namely, it creates images any way you want based on the data. Often based on lots of data. Like billions of unique points or gigabytes of gridded data that would be otherwise impossible to display.

I’m not affiliated in any way with the Holoviz ecosystem, but I feel that it is a seriously underrated set of tools that should get way bigger attention in the data science community. One of the possible reasons why it doesn’t get it is that for some functionality the documentation lags behind its actual capabilities, especially when one has to understand what’s happening without prior knowledge. That’s why this tutorial walks through the possible pitfalls and tries to make transparent how everything works, hopefully making these tools more accessible to the wider audience.

Data Sources

Setup

We will use Google Colabs, a free Jupyter notebook environment that makes it easy to experiment and share the results. You can follow the whole article through the Colab Notebook, or you can use it to experiment after you finished reading this.

It already comes with many modules preinstalled, but we are going to need some more, and also we want to know we have the exact same versions we need.

We can run command line commands by prepending a line with !, so we will use that to run pip and install the latest versions at 2021-04-16 for these modules. We may get some incompatibility errors for underlaying dependencies, but they won't cause any problems with this exercise.

Next, we import all the necessary modules with their shorthands. We will use these shorthands a lot, so feel free to come back any time you are not sure where some of the functions we use come from. We also display the version of some modules just to be sure everything is in order.

Versions:  {'holoviews': '1.14.3', 'bokeh': '2.3.1', 'rioxarray': '0.3.1'}

We are also going to need the data itself. I’ve already exported the elevation map of Kenya as a cloud-optimized GeoTIFF file from Earth Engine to my Google Drive. (Let me know in the comments if you would be interested in a tutorial on how to do that with the various datasets available.)

Google makes it possible to download publicly shared files from Drive to Colab with the preinstalled gdown tool.

If you want to import your own dataset into Google Colab, you simply need to share your Google Drive file for “Anyone with the link”. You will get a link like this:

https://drive.google.com/file/d/16dkulaX8RkkuODqualrarIWtG_iBDMSZ/view?usp=sharing

This is not a direct download link. It opens up the file in the Google Drive preview window. To get a link you can use to download the file, you need to copy the ID part of your original link (marked with bold in the example above) and put it into this link:
https://drive.google.com/u/0/uc?id=IDCOMESHERE&export=download

Your end result will be something like this:
https://drive.google.com/u/0/uc?id=16dkulaX8RkkuODqualrarIWtG_iBDMSZ&export=download

This is a direct download link of your file, so we can use this as a source_location with gdown in the following way:

gdown -O local_destination source_location

As the default working directory on Google Colab is /content, we create a directory named kenya first and then download the two example files I've shared into that directory.

In the download log, you can see that the smaller file is 60.3KB, the bigger one 329MB. That’s a huge difference, so at first, we start with the smaller one.

Loading and understanding the data

We will use the rioxarray module to load the data found in the file into an xarray data structure.

Sidenote: If a Google Colab cell contains a last row with just a variable, it will display it in a meaningful way. We will use that to see the result of each step from now on.

Without going into the details of this data format, we can see that our data has a dimension of 218 by 179, with only one band: the elevation data itself. We can also see that the x values are somewhere around 30 to 42, while the y values are around -5 to 5, so they look like reasonable values for coordinates.

To get a bit more familiar with our data, we can easily convert it into a pandas dataset and display that. We can see that it is indeed a gridded dataset, with the coordinates being the row and column headers and the elevation data in meters being the values. (NaN or Not a Number values show missing values as I cropped the data to the shape of the borders of Kenya.)

It’s entirely not necessary, arguably even pointless, but to hammer in the “we are working with colorless gridded data, not an actual image” point, we can even export and download our data as a CSV file and open it in Excel if we want to.

The same dataset displayed in Excel. (Image by the author.)

Configuration

Before going any further, we will set up a few configuration values, starting with our colormap. Usually, we can just give a list of colors or a name of an existing colormap to Holoviews, but I like to use a specific subset of a colormap for the current dataset. I’m not going to explain the details here as you are not going to need this for your own data. It’s enough to know that we take the 25%-100% part of the terrain colormap, and get a list with 192 colors spreading through that range.

The resulting list of colors

As the next step, we set up the default options that all our Holoviews elements will use from now on. See the documentation of them here and here. We make sure we use our colormap and width/height settings, display the colorbar. We also turn on the hover tooltip and the ability to zoom in and out by scrolling with a mouse wheel or touchpad.

There are only two things that need a bit more explanation.

  1. I’m going to include the hv.extension('bokeh', logo=False) line at the beginning of every cell. This lets Holoviews know that I want it to display the elements with Bokeh (but to not show the Bokeh logo every time in the output). If you are running your script locally (not on Google Colab), you only need this once. Supposedly it is also possible to make the Colab notebook remember it without repeating, but that proved a bit unreliable for me, so I prefer using it this way.
  2. The data contains Not A Number values outside of the border of Kenya. By default, it would show up as a gray area, but we don’t want that, so with the clipping_colors parameter, I force it to be invisible (the last two digits of the #00000000 RGBA code meaning 0 opacity).

Displaying the data as an image

Now that we have everything ready let’s load our data into a Dataset. Not a huge change, but Holoviews will need the data later in this format. dataarray[0] refers to the first band in our dataarray. We also let it know about the value ('elevation') and coordinates (['x', 'y']) to use, just as we defined them above.

We will immediately see why this is convenient as now it is really effortless to feed this dataset into a Holoviews Image and make Holoviews display it with all its glory automatically using our default configuration:

As you can see, we have an excellent display with interaction icons, zoom, colorbar, etc. We can see the coordinates of the image on the axes, the y axis being kinda centered around the 0 mark, Kenya being on the Equator. We even have an automatic hover tooltip (created above with the tools=['hover'] configuration) that shows the exact coordinates and elevation values if the mouse is over the image.

While we’re at it, let me demonstrate the benefits of the TIFF only containing raw data without any color information.

Below you can see the exact same Holoviews Image we created with the exact same data as above. I’ve just given them two other colormaps. The first uses the inferno continuous colormap painting the lower values black, the higher values going through purple, orange and yellow. The second one uses the Dark2 colormap useful for categorical visualization: just by using this colorscheme, we immediately categorized our data into 8 distinct elevation levels without changing anything in the data itself.

You can read more about the available colormaps here.

But let’s get back to the task at hand. It’s a good thing that we have our image, but we will need a tilemap too. Fortunately, it’s very easy to get one. You can see all the various options (dark, light, and terrain) on the Holoviews website.

We are going to pick Open Street Map for now.

Putting the image on the map

Ok, that’s a map, zoomable and everything, but it’s empty. This is where the “it just works” magic of Holoviews comes into play: using the * operator, we just throw the image on top of our map and hope for the best:

Well, this is awkward… The image is visible, but instead of being on the map, it is on a white background, and the axes show some weird coordinates too…

To understand this phenomenon (that causes the overwhelming majority of the “image on a map is weird” questions on StackOverflow regardless of the tools used), we need to understand just a bit about coordinate systems, oversimplified as usual.

Our image uses a Planar coordinate system, the one where values usually range from -180 to 180, in degrees. It is awesome as it lets us use locations on a sphere-like object like Earth. On the other hand, every available tilemap engine uses a Web Mercator coordinate system, which measures everything in meters. Mercator is optimized to display maps on a flat surface, meaning our screens. This can be very confusing because these Mercator maps do show us the Planar coordinates on the axes (as those numbers are the ones everyone is familiar with). But that’s just a display feature; internally, they all use meters. So when you throw them a dataset with Planar coordinates being around -4.2151 or 32.14, these values were meant to be understood as -4.2151 degrees and 32.14 degrees. At the same time, the tilemap will interpret them as -4.2151 meters and 32.14 meters distance from the origo. And while 360 degrees are enough to get around the Earth, 360 meters are a way-way smaller distance.

The result is that our image is indeed displayed on the map, right around the 0,0 point, spreading just a few meters into each direction. Because the Holoviews element automatically zoomed the map in to show the image in its full few hundred meters glory, the map is so zoomed in that it can not even display any map tiles at that resolution for the middle of the ocean. You can see this happening if you start zooming out, and you just keep zooming, as seen on this gif:

Reprojecting the data

Now that we understand why is this happening, let’s do something about it. We need to convert our data to the appropriate coordinate system. Converting between Planar and Mercator coordinates is not something we should do by hand. Not just that it would be a real pain trying to figure out how to convert a more or less spherical object to 2D, but in reality, the dataset could use many slightly different Planar coordinate systems, and we definitely don’t want to deal with those. (Programming 101: you should never ever try to reimplement functionality dealing with the details of dates, timezones, translations, or coordinate systems.)

We just happen to be in luck, as the rioxarray module we used to read our GeoTIFF has a function that deals with this exact situation. We just need to give it the code of the desired projection (in this case, the code for Mercator is 'EPSG:3857'). Let's start again, this time with the correct projection.

Note: Reprojecting a huge file can be pretty resource-intensive, so in a production environment, I would recommend saving the result in a Parquet file once and loading that with Dask when displaying the data, but that is outside of our scope now.

This time we can already see in the x values (using scientific notation, 3.775e+06 meaning 3.775 * 10^6) that they are between 3,775,000 and 4,666,000. These are definitely way bigger numbers than before, precisely as we would expect as we use meters now as measurement. Let's just run through the exact same steps as before (without the configuration part that is still valid and usable):

Hurray, it works! Kinda…

We have the map, we have the image on top of it at the correct position, we can zoom and drag, and we even have a hover tooltip… That is broken now:

Sigh. It makes sense: this time, we gave Mercator coordinates to the system to display, so the map can not display the familiar Planar coordinates in the tooltip either. Instead, we get everything in meters, in scientific notation non the less, which is definitely not useful for most of our users.

So we have to convert the Mercator coordinates back to Planar coordinates on the fly when we display the hover tooltip. For this, we need to use some JavaScript. If you are not familiar with it, don’t be afraid, you can use this as a copy & paste solution and forget about it.

Creating a Custom Hover Tooltip

If you recall, I’ve mentioned that Holoviews can use multiple plotting libraries, and in this example, we are using Bokeh. It is a Python module that lets you write python code, then it translates it to JavaScript and displays it in the browser with the BokehJS library. Which fortunately has a specific tool for converting Mercator coordinates back to Planar coordinates. Although till now, we did not even use the python Bokeh module directly, this setup lets us feed some info straight into BokehJS on how to display the hover tooltip.

Let’s try it again, this time with the newly created custom tool.

Perfect!

Pfew. That was a ride…

But what about the huge pixels?

There is an elephant in the room: we only used the small dataset so far where every pixel covers 5km*5km. Thanks to this, the size of the whole compressed file is 60KB, so we could easily unpack the data on the server and send the entire stuff to the browser. The price we pay for that is that our image is pretty pixelated, especially zoomed in.

This could work for some cases, but our theoretical boss/client has a very detailed map they want to use. Every pixel on that map covers 50m*50m, so the whole dataset is around 22,000 pixels high and 18,000 pixels wide. This would let us display a lot more detailed information, and that could be crucial if our users would need the values for a particular location. The problem is that for the high-resolution data, even the compressed GeoTIFF is over 300MB. If we tried to display the whole dataset in one go, the unpacked data would be measurable in Gigabytes, causing enormous download times and then guaranteed memory freezes in the browser.

Doing that seems a bit unnecessary too. If you think about it, when the user sees the whole image at the default zoom level, they would see all the 22k*18k pixels. The resolution of their screen is way lower, so a much lower resolution snapshot would suffice. Higher resolution images would be only meaningful when they zoom in more and more. But when that happens, they would only see a smaller and smaller portion of the image, so it’s useless to send them the not-visible parts too.

This means that we need a solution

  1. that can catch the current viewport and zoom level,
  2. sends an image to the browser that only covers the visible area
  3. and does that in a reasonable resolution.

That sounds like an enterprise-level functionality we would spend the rest of our lives developing. Except that Holoviews integrates with the Datashader module, and they together can do precisely this.

(Quick side note: Datashader is an incredible tool to generate displayable images from undisplayably big datasets. It lets you display billions of standalone coordinates, or as in our case, display otherwise unusable gridded datasets. If you plan on using Holoviews, you definitely should check out Datashader too.)

Datashader integration to dynamically display high-resolution images

Let’s repeat again what we did above, but finally with the large dataset.

Note that this time I did not display the resulting Image. The reason for that is that depending on the exact dataset, either the Colab kernel would completely crash, or the whole process would just hang until your browser crashes for being out of memory.

Instead of combining this hv.Image() directly with the map, at first, we create a dynamic image based on it. We do this by simply throwing it into a Datashader function that Holoviews makes available, and we imported it as hd.regrid. Then we combine this newly created dynamic image with the map.

If we zoom and pan, we can see that something awe-inspiring is happening, and we automatically get better-resolution images. But at the same time, the picture gets weirdly recolored, and the colorbar is changing every time. These things did not happen before.

The explanation is that earlier, our little widget used the whole low-resolution dataset, so it knew what the lowest and highest values were. From those, it could create a robust colorscale that covered the entire data spectrum from around zero to about 5000 meters. This meant that the same value always got the same color no matter where we zoomed.

However, the display part of our new dynamic system only receives the visible part of the dataset. Every time the user interacts, the widget has to use the currently visible lowest and highest values to stretch the colorscale between those. If we zoom into a flat area where the highest value is only a few hundred meters, that will be colored now to the color reserved for the highest value (white) with browns, yellows, and greens covering the values below that. Conversely, if we zoom into a mountaintop where even the lowest visible point is above 2000 meters, that will be colored green, yellows, browns, and whites starting above that. That’s why the image gets recolored every time and why the colorbar shows new values. This actually could be a valuable feature for many cases as it makes the small details of whatever we are focused on pop out, really highlighting the minor differences, but it’s really terrible if you want to have a consistent image.

Fixing the Colorbar

To resolve this issue, we will query our dataset and get the actual minimum/maximum values for the whole dataset.

(-20.0, 5035.0)

Next, we update the color range limits to these values. We do not update the default options because this information is strongly related to this data. If we wanted to display some other dataset next to this, it would be unreasonable to use these limits for that too. Instead, we apply them directly to our combined element.

Where to go from here?

Hopefully, your client/boss accepts a Colab notebook as the end result. Otherwise, you are going to need a more professional way to publish your widget. We’re going to explore that topic in a possible future article.

For now, let me just throw a quick example here of how some of our already discovered features could be presented in an interactive widget.

You can check out the live examples in the Colab Notebook:

https://colab.research.google.com/drive/15aG8Abyv6mWRkvk7IsR3J5WLGYOI2tj0?usp=sharing

Possible alternatives

Whatever your reasons are for not wanting or being able to use the solution above, there are a few alternatives I know of, all of them with some caveats.

Geoviews

In the same Holoviz ecosystem, there is a module specifically for dealing with these kinds of tasks: Geoviews. Supposedly the hv.element.tiles.OSM() * hd.regrid(gv.util.load_tiff('your/file.tif')) line itself gives you the same result as our whole code above (although I did not test it), so it looks like a no-brainer to choose this.

However, there are two problems with it. One being the serious lack of documentation of the feature, as even their own website only mentions the “load_function” string once (barely) in a release note. And the other one is that Geoviews is able to do a lot of amazing GIS magic in the background because it relies heavily on a lot of scientific GIS modules. Some of those need manual installation — that takes a lot of time, effort, and experience to get it right. Like, a lot. I’ve done it a few times with varying success, and it can be a real pain. Based on the forums, I’m not even sure it is possible to set it up on Google Colabs, for example. If you have a cushy Linux server that you only need to set up once and you can use the same environment for a long time, then it could be the best solution for you. If you keep jumping from Colab through local Windows development to a whatever Linux server or a serverless production, just as I do, you better not rely on something you’re not sure you will be able to install next time too.

The developers know of this issue. Hopefully, they will be able to refactor and rewrite the module to only need less troublesome dependencies as it is a pretty genius tool. But for now, it is what it is.

Dash + Plotly

Dash and Plotly are somewhat similar to the business sector as the Holoviz ecosystem could be to the scientific area. Arguably, they have a less steep learning curve and more thorough documentation.

What they don’t have, however, is any way to put a gridded dataset onto a map like we did, at least at the time of writing this post. It’s a shame as they have charts for various gridded representations (like px.imshow, go.image, go.heatmap), and they have good support for tile-based maps, but no way to combine them. This is sad as they even have Holoviews and Datashader integration. So the best thing you can do is follow one of their Plotly+Datashader tutorials that would allow you to put a visual image (even one created by Datashader) onto a map, but you won’t have access to the data itself, so no Hover Tooltip for you. If you don’t need to access the data, and you need to integrate your project into a Dash app, it could be a reasonable compromise.

By the way: if at any time Plotly will become able to display gridded data on a map, thanks to the Holoviews-Plotly integration our whole exercise above needs only one change: hv.extension('plotly', logo=False), and then everything should show up in Plotly, ready to be displayed in Dash. Fingers crossed.

Dash-Leaflet + Terracotta (or any other way to create and show tiles)

Another thing you could do is to convert your GeoTIFF to a similar tileset these maps already use and put that tileset on top of the original map. Mapbox.com tile service is one way to do that, but also there is a python module called Terracotta that takes your GeoTIFF, creates map tiles, and serves them as a server on demand. From there you just need a map that can use them.

There are many options, but Dash-Leaflet is one of the best. It’s a Python implementation of the popular LeafletJS mapping library and the guy who ported it even has a Terracotta-Dash-Leaflet demo showcasing the exact same thing, so it is quite easy to get up and running. It has the same downside as the Plotly+Dash solution. Since the browser only receives visual images from the server, there is no way for it to show the data on a Hover Tooltip. However, it has a smart workaround where on mouse click, it can send a request back to the server and get the exact data at that location. It is not as seamless as a tooltip, but if you only need to see the data on rare occasions, this can be a good solution. Also, Emil, the creator of the Dash-Leaflet library, provides some enterprise-level support at the Dash forums if you have any questions.

But as I mentioned, there are multiple ways to create and serve tilesets, and if you have that, you can use practically any charting library that can display tile-based maps, so you do you.

Do you know of any other methods? Let me know in the comments!

Acknowledgment

Thanks to Philipp Rudiger, Bryan Van de Ven, James A. Bednar, and Marc Skov Madsen not just for their contributions to the Holoviz ecosystem, but for answering my relentless questions.

Also thanks to @blois at Google for immediately looking into some incompatibility issues between Colab and Holoviews when it came to interactive functionalities.

--

--

Started coding in ’94, hasn’t stopped since. Fascinated by data, committed to social change. Former CTO@SiteSimpl. Freelancing as a full-stack data guy.