
Whenever I use tools like Google Maps or Uber, I sometimes find myself overlooking the complexity of the challenge they aim to solve. These apps need to make highly advanced calculations all while showing you a fun visual of a map with data that is usually highly accurate. To say the least, it’s very impressive how far we’ve come with geographic data and mapping tools!
In this article, I’d like to walk you through some of the most popular data formats and coordinates that data scientists and engineers use for geographic data. I’ve you’ve ever taken a look at map projections of the world, you’ll know that there are many many ways to visualize spatial geography and while this guide does not go through ALL of the tools, it should provide a good overview of some of the more popular ones.
Geospatial Data Formats
GeoJSON
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [125.6, 10.1]
},
"properties": {
"name": "Dinagat Islands"
}
}
If you’ve worked with JSON data then GeoJSON data should look pretty familiar! GeoJSON data is easy to process and also simple for the user to understand. It represents geographical features with coordinates and supports geometry types like Point, LineString, and Polygon. GeoJSON is one of the more popular data formats for spatial data and easily integrates with Python, R and tools like Tableau for easy analysis.
There are also great tools available for GeoJSON data, like https://geojson.io/ which will let you test out GeoJSON data to learn more about it.
Shapefile

Shapefiles are also one of the most common formats for spatial data that can store both geometric (shape) and attribute data. A shapefile is really just a collection of files that collectively represent points, lines, and polygons along with their associated attributes. Like in the example above, each different portion of the map could be a shapefile each one with the accompanying value to determine its shading.
Some of the files that make up a shapefile are .shp (geometry), .shx (index) and .dbf (attribute data)
Shapefiles are very useful for GIS-specific software like ArcGIS or QGIS, they are typically very customizable when used in these environments, for example it’s easy to change the projection or coordinates of a shapefile when using ArcGIS. For quick comparisons, it’s also really easy to convert shapefiles to GeoJSON using tools like MyGeodata Converter.
KML
KML stands for Keyhole Markup Language, this data format was developed by Keyhole Inc. (which was eventually acquired by Google) and is usually used in Google applications like Maps and Google Earth. Just like the other two formats, KML files store geometric data and attribute data and work great for creating interactive data maps. It’s also very easy to convert KML files to GeoJSON and Shapefiles using bash scripts or online tools like https://ogre.adc4gis.com/.
KML files have an XML-based file format, the key elements of the file’s hierarchy are:
- Root ‘kml’ element which basically organizes the geospatial data.
- Document element which holds components like styles, folders, and groups of geospatial elements.
- Placemark element which is used to define individual geographic features on a map. Each placemark also contains the attributes of a point, line, or polygon.
Coordinate Systems
WGS84
WGS84 is short for World Geodetic System 1984. It’s pretty much the global standard for mapping and representing spatial information. If you can think of any technical, spatial application, like a GPS, there’s a good chance it’s probably using WGS84.
Locations are represented using latitude and longitude in the form of coordinates. For example:
Statue of Liberty, NYC:
- Latitude: 40.689247
- Longitude: -74.044502
These seem pretty easy to understand which is great for us! On the backend though there’s a lot of complicated math behind calculating these coordinates. In general, there are two important things you need to understand when thinking about how WGS84 works:
- Geodetic Datum – which is the mathematical model for things like origin, orientation, scale, and position for all coordinates. If you look at those coordinates and think "how far will I move if I change 40.7 to 40.8 in the latitude?" you’ll check out the datum.
- Ellipsoid – which is a 3D shape that approximates the surface of the earth more accurately than a sphere. For many reasons, like the Earth’s rotation and poles, an ellipsoid works much better than a perfect sphere at mapping the surface.
Like i said, the math can get a little complicated but if you’re interested in learning more this video by Aviation Theory gives a detailed breakdown.
UTM
UTM is short for Universal Transverse Mercator, it’s a global grid system that emphasizes local precision by dividing the Earth into zones. Each zone has its own set of local coordinates which minimizes distortion within each zone. UTM coordinates look like this:
(580735.812, 4504700.604, Zone 18T)
- Easting: 580735.812
- Northing: 4504700.604
- UTM Zone: 18T
As you can see, they look very different from WGS84. UTM and WGS84 serve different purposes and are used by many different applications. UTM works great for accurate, local mapping needs like urban planning or land surveying.
Working with Diverse Geospatial Datasets
Working with spatial datasets can be very frustrating, especially when you’re using data from different sources – all of which might be using different methods of spatial calculations that don’t align. These are some of the issues I’ve run into and how you can avoid them:
Different formats or systems: This is basically what this whole article is about. When working with geographic data, converters are your best friend because it is nearly impossible to analyze geographic datasets that are using different formats. So when you’re working with multiple datasets, always double-check the
- Format – is it GeoJSON, Shapefile, KML or something else?
- Coordinate system – is it WGS84, UTM or something else?
- Software – is your software reading the data correctly? geographic data can confuse some software so always double-check how your software is processing data if you have any issues.
Do your future self a favor and pick a consistent format and coordinate system then convert all your data to that. This won’t always be the best option as you might sacrifice some precision, but in general, staying in the same formats/system will save you tons of time.
Processing and data cleaning: As you might already know, data cleaning is a pretty standard part of data analysis. When working with spatial data though there are a few things that make it different, in my experience these are the things you need to look out for when cleaning spatial data:
- Missing coordinates/outliers – For example: you’re working with data from the USA and there’s one random data point in France. Before analyzing the data make sure there are no outliers, and if any coordinate data is missing you might be able to find it by using online tools like Google Maps for generating coordinates!
- Projection mismatches – This happens all the time and will mess up your analysis if you don’t catch it while cleaning. In general, good datasets will have the datum and projection included as a field so you easily compare and convert them. If these fields aren’t included you should go directly to the source, usually they’ll have a data dictionary or methodology overview that tells you what projection they’re using.
- Scale and precision – If you scroll up to the example coordinates for UTM and WGS84, you’ll see they both have a lot of decimal places. Those points are pretty precise, but not all datasets are going to be that precise. In general, just try to be consistent with the number of decimals places and as precise as you can be!
Conclusion
I hope this overview gave you a good idea of what it’s like to work with spatial data. As you can see, it gets pretty complicated, but at the same time the systems we have set up work very well! As long as you know what formats/systems you’re working with, you should have an awesome experience working with geospatial data!
Thank you for reading!
Want More From Me?
- Follow me on Medium
- Support my writing by signing up for Medium using my referral link
- Connect with me on LinkedIn and Twitter
- Check out my Data Science with Python guide on benchamblee.blog