Geocoding on the Fly: Cleaning the Data

Melissa Anthony
Towards Data Science
3 min readFeb 9, 2021

--

Geocoding is a basic element of mapping. This is taking a list of addresses and giving them an XY coordinate for location. This XY coordinate allows you to add symbology classes to reporting like these examples:

Photo from personal projects by the Author

There are a few steps to get there. The first is to clean the dataset. Throughout my career, I have noticed the importance of consistency. Even if something was cataloged incorrectly, it is important that it was done consistently incorrect. So understanding how your data is structured is the number one priority. You begin by going through a checklist of:

  • How many rows and columns are in my dataset?
  • What are the data types of the columns? And does that make sense for the data?
  • Display the first 5 or 10 rows of data, what does the data look like?
  • Whatever language you use, do some value counts for a better idea of how the data is structured in each column.

Now I will provide a python script that I use when I am given a datasheet of service addresses. Then I will go over what the lines of code mean. First I am creating a dataframe that I will work on to place my geocoding information. This can be concatenated later to the original dataframe, or it can be saved as an additional sheet in the original excel file.

The first line is creating the Dataframe with the columns to be calculated. I then add an address field that will make combining the house number and road easier for a final clean address field.

This script is used to break a single column of addresses into multiple columns of data as a temporary dataset that can be appended to the temporary dataframe.

Code created by the Author

First I am looping through each Service Address row and using the tqdm library to provide a progress bar. I know the service address column is a string object, so I am splitting it up, separated by commas, and putting the values into a list that is unique to the row. Then I use list comprehension to strip all leading and trailing empty spaces from each value.

On the next set of if and elif statements, I am looking for the length of the values in the string. Since I know that the service address is written in one of the following formats:

  • 1234 Main Street, All Places City, State, 8000–000
  • 1234 Main Street, All Places City State, 8000–000

For each statement, I define the variables of house number, road name, city, state, and zip code by pulling each string variable from the list and breaking it down appropriately.

This last bit of the code compiles all of the defined variables as a temporary series with the index matching the column names. Then appends that as a row into the temporary cust_add_db. One line of code you can add to the dataframe is a unique index number that will help concatenate this back to the original.

In my next article, I will talk about code that can help find and remove apartment numbers, units, etc. These can confuse open source geocoding services.

--

--