Current Vector Data Standard Formats
If you have been using the vector data and doing spatial analysis, you know shapefile and Geojson. These are two of the most commonly used vector data formats to store data and carry out any spatial analysis. GeoJSON, in particular, has been the go-to format while serving data to web applications. But, both these formats have a lot of disadvantages when you wish to scale your work and build integrated & automated workflows for large-scale deployments. Geopackage format offers a variety of features in this regard. And that’s why you need to use Geopackage files instead of shapefile or GeoJSON. Let’s dive deeper into the details.
If you would like to read more about geospatial data and how it’s changing the field of data analytics, check out my blog on the topic here.
Why Geospatial Data is the Way Forward in Data Analytics? – Samashti | Blog
Download Rainfall Satellite Data from CHRS Data Portal using Python – Samashti | Blog
Problems with Shapefiles
Shapefiles have been around for a long time now. ESRI developed this format in the early 1990s & since then, it has become one of the widely adopted standard formats to work with and share vector data among people. Shapefile stores non-topological vector data along with related attribute data. Though widely used, it has quite a few & significant disadvantages for modern use cases;
- Shapefile is a multi-file format. Each vector layer you save has a minimum of 3 files (.shp, .shx & .dbf) and several other attached files with different extensions. So, if you want to share a shapefile with someone, you have to share all those files for one layer. And if you have several layers, the number of files is large. It’s not ideal to have ~4–6x the number of files for each of your projects.
- Shapefile supports related attributes data similar to a tabular dataset with column headers. But you can only have ten characters to define the column header, and it is not always ideal to have an abbreviated form of column headers where some description/identification is necessary on column headers.
- The shapefiles have a maximum size of 2GB. You can’t export a vector layer with more features that may exceed the 2GB as a shapefile.
- The shapefiles can’t have more than one geometry type in a file.
- As the size of the shapefile increases and as you deal with more attribute columns & rows, the performance of the shapefile drastically reduces and it becomes sluggish even with a spatial index on QGIS.
Problems with GeoJSONs
GeoJSONs were in part created to address the multiple files problem with the shapefiles. Built as an extension of JSON objects used on the web applications, it did solve a few of the issues shapefiles posed. But it has its own set of limitations;
- For a similar number of vector features with attributes, GeoJSON has almost double the file size compared to shapefile in most cases.
- GeoJSONs have no spatial indexing. So, it’s tough to handle when dealing with a large number of features. And just panning around the spatial features to explore on a QGIS map canvas is a tiresome task most of the time.
- Whenever you load the files to run some tasks, the entire file is loaded onto the memory at once, and this might create problems in several scenarios, especially with large files.
- Also, the loading of the files is usually slower compared to shapefile and geopackages, but the memory consumption is similar or more.
- If the file size exceeds some limit (~10–11 GB in my experience), the features might get written incompletely, hence making the file corrupt.
What’s GeoPackage?
Developed by OGC as an open format for Geospatial information, they define the GeoPackage as below;
GeoPackage is an open, standards-based, platform-independent, portable, self-describing, compact format for transferring geospatial information.
A geopackage, in essence, is an SQLite container with OGC encoding standards for storing vector features, tile matrix (raster data), non-spatial attributes data and extensions.
By default, each of the geopackage files has a few meta tables like below to understand and handle the geospatial layers.
'gpkg_spatial_ref_sys',
'gpkg_contents',
'gpkg_ogr_contents',
'gpkg_geometry_columns',
'gpkg_tile_matrix_set',
'gpkg_tile_matrix',
'gpkg_extensions',
'sqlite_sequence',
Advantages
- Open source, based on SQLite database
- Very lightweight but highly compatible across environments (esp. in mobile devices where connectivity & bandwidth is limited)
- Geopackages are generally ~1.1–1.3x lighter in file size compared to shapefiles and almost 2x lighter with respect to geojsons.
$ fs road_network/*
193M road_network/roads.geojson
70M road_network/roads.gpkg
81M road_network/roads.shp
- Since the vector layers in geopackage are inherently rtree indexed (spatial indexing), loading file on QGIS or making queries on the file database is fast.
- There is no limit on the file size and it can handle a large number of features in a smaller file size.
- Compared to shapefiles, the column headers can be full names and right by providing the correct context for each column.
- You will see a faster run and algorithm outputs on geopackages compared to shapefiles (You can try this on QGIS).
- A single geopackage file can have multiple vector layers with each layer having a different geometry type.
$ ogrinfo ./outputs/road_network.gpkg
INFO: Open of './outputs/road_network.gpkg' using driver 'GPKG' successful.
1: roads_area_hex8_grids (Multi Polygon)
2: roads_area_hex9_grids (Multi Polygon)
3: roads_area_major_segments (Multi Polygon)
4: roads_network_lines (Line String)
5: roads_poly_line_vertices (Point)
6: roads_intersection_node_points (Point)
7: roads_end_node_points (Point)
- You can have non-spatial attribute tables (pandas tables) as well along with vector layers.
$ ogrinfo ./india_villages_master_2011.gpkg
INFO: Open of './india_villages_master_2011.gpkg' using driver 'GPKG' successful.
1: village_data_master (None) # (non-spatial)
2: village_codes_mapping (None) # (non-spatial)
3: village_points (Point)
4: village_voronoi_polygons (Multi Polygon)
- We regularly deal with making changes to the vector layers as the data is updated. And loading and editing the features of geopackage files on QGIS or Python is faster.
- The file can be handled using GDAL, QGIS, Python, R, SQLite and Postgres (with few limitations on each mode)
- Adding and loading the geopackage to a Postgres database is much faster compared to Shapefile or geojson (which takes forever with some larger datasets) since it’s already a database format and spatially indexed (compared to shapefile or geojson).
- Interestingly, geopackages can also handle rasters as a tile matrix. (of course, there are some limitations to this)
How can one use it in their Workflow?
We understood the advantages of using geopackage files compared to the shapefiles and GeoJSONs. But how and to what extent can we integrate the geopackage files in our spatial analysis workflows? Let’s explore a few options.
- Large Output Files
- Tiled Tables / Multi-layer Files
- Reduce/Avoid Redundant Files for Outputs
- Spatial Views
- Load only Parts of Vector Layer onto Memory
- Handling Geography Masters
- Work In Progress (WIP) Layers in One File
- File imports for CartoDB
- Samples, Default Colour Styles & other Attributes
All these points are explained in detail on my blog. Head out to my blog to read more on how geopackage can be used to make your spatial analysis workflow much faster.
GPAL
GPAL (Geopackage Python Access Library) is something that I built to address a few limitations with using geopandas to read and handle geopackage files in Python. A few of the features of the geopackage format I mentioned above don’t have methods in Python. Therefore I started by building a module.
At the moment, the module can read and handle the file operations on geopackage files with SQL queries similar to a sqlite3 module on an SQLite database. It helps in loading only parts of vector data onto memory instead of the whole layer. And currently, I’m working on a few other features as well.
What’s Planned?
- Since geopackage handles both spatial and non-spatial tables, a method to process both these data tables consistently from the python module is necessary.
- The table view is a feature of databases, and since geopackage is based on SQLite, we can extend it to spatial views, as I mentioned above. But it involves handling the gpkg meta tables inside a geopackage file and needs to be handled with a method of its own.
- Methods to handle the multi-layer format files in different workflows.
Geopackage is a very light and highly compatible vector file format. You can save a good amount of storage, deal with fewer files, run algorithms and visualize faster. Being an open-source format with continuous updates is the icing on the cake. All current & potential features of the format inspired me to take up the GPAL project to develop something I hope to add more versatility to using geopackage on Python.
I hope this article helped you understand the features and advantages of geopackage over other vector file formats. I’d be happy to receive any suggestions to improve this further.
If you liked this blog, subscribe to the blog and get notified about future blog posts. You can find me on LinkedIn, Twitter, for any queries or discussions. Check out my previous blog on how to use QGIS spatial algorithms with python scripts.
How to use QGIS spatial algorithms with python scripts? – Samashti | Blog
How Alternative Data is Helping the Companies Invest Big? – Samashti | Blog
Some Conversions
Using GDAL from the command line
-
convert a shapefile to geopackage
$ ogr2ogr -f GPKG filename.gpkg abc.shp
-
all the files (shapefile/geopackage) will be added to one geopackage.
$ ogr2ogr -f GPKG filename.gpkg ./path/to/dir
-
add geopackage to postgres database
$ ogr2ogr -f PostgreSQL PG:"host=localhost user=user dbname=testdb password=pwd" filename.gpkg layer_name