Photo by Andrew Neel from Pexels

Named Entity Recognition and Geocoding with R

A quick guide to finding and geolocating place names in historic (or contemporary) texts

Towards Data Science
8 min readFeb 16, 2021

--

The digital humanities has many uses for technology and GIS applications for analysis of historic (and contemporary) works. In addition to open-source software, open-source data allows greater accessibility for analysts.

In this post I will introduce and provide a brief guide to named entity recognition (NER) and geocoding in Rstats for DH applications. I will use open-data provided by the excellent Gutenberg Project, as well as Open Street Maps (OSM) using open-source Rstats packages from the excellent Rstats community.

This guide requires that you have some familiarity with R programming.

What is Named Entity Recognition?

NER is a task of natural language processing, which identifies and tags entities within a text. These entities can be people, dates, companies, or in this case, locations. Rstats has several great packages for natural language processing, including openNLP and spacyr.

What is geocoding?

Geocoding is the process of finding spatial coordinates for locations. With R, it is possible to geocode programatically, but it is an imperfect process, and requires that extra attention be paid to the outcome.

What is the process?

I have broken this process into five steps: Getting the data, cleaning the data, annotating the data, geocoding and evaluating the data. This process is an iterative process. Because we are working with text, it is easier to miss details and mistakes. Programming provides a very useful tool, but close-reading and familiarity with the text is essential.

Setting up

For this project I will use the book Through the Casentino: With Hints for the Traveler by Lina Eckenstein, downloaded from Project Gutenberg using the gutenbergr package.

Step 0: Loading packages.

Step 1: Get the data

Downloading the book

Downloading books with gutenbergr is quite simple. Books can be downloaded easily by finding the gutenberg id and passing it to gutenbergr.

(Note: I have had some trouble in the past with the default download mirror, which can be fixed by choosing an alternative mirror. You can see an explanation of the error in this thread.). The result gives a tidy data frame with a line in each row. This is really useful for analysis with packages such as tidytext and quanteda.

Step 2: Clean and prepare the data

Now that you have the data, you need to do some tidying. The books available from Project Gutenberg include front matter, such as the title and publication information. This is not really useful for the analysis, so you can get rid of any data before the start of the actual text. Uou can also remove certain characters, or edit strings within the text that you might need to.

Tokenize the data

An important part of textual analysis is tokenization. This is the programatic process of breaking the text down into smaller sections, usually words, lines or sentences. Now the data is essentially tokenized into lines, but I would like to create sentence tokens for use with the NER annotator.

There are a number of different packages with built in tokenizers. The word tokenizer built in to the tidytext package works well with word tokens, but sentence tokenization with tidytext does not always provide the best results, so I will use spacyr instead. Spacyr is a wrapper for the Python package, spacy. It is a powerful package for natural langauge processing, but it requires that you have python installed, and needs to connect to a Python session from within R.

The spacyr tokenizer requires a text string or corpus, which means that the dataframe will have to be collapsed so that the book is a single character string. After that, we can reformat the resulting sentences into a dataframe with once sentence per row.

(Note: You can keep your data in the current format returned by gutenbergr. The results may differ, but the principles are the same.)

Retrieve locations

Next, we will retrive the locations from the text. There are a number of different entity taggers available in R, but I will be using the entity package, which offers a wrapper of the openNLP annotators. I prefer this tool because it is not too RAM intensive, and is a process that my computer can handle.

You return the result in any format you like, but I have chosen to append a new list column to the dataframe where locations are found in the text. The function returns a list of lists, meaning that we still have one row for every sentence, and a list where two locations occur in the sentence.

I also want create a separate dataframe of un-nested locations, which is a little more readable. Here we select the locations column and un-nest it, giving us one observation for every instance of a location in the text, also removing the NULL objects from the list. The resulting ‘locations’ dataframe returns every location found in the text along with its row number. I will also group and count the locations, giving one row per location and the number of times it occurs in the text. This will be useful for geocoding and visualizing the data later on.

At this point it is necessary to do a preliminary check of the results. The entity locator may sometimes find incomplete or wrong entities. By creating a location row alongside tokenized sentences, we can try and pinpoint what these half locations are referring to.

In this case, the entity locator found quite a few locations, but if you check the locations against the text, we might find that, in addition to the incomplete entities, it has missed some instances of locations that it found elsewhere in the text. It has also tagged entities that aren’t actually locations.

By keeping the location instance with the row number, you can easily check the particular sentence to infer what the entity refers to, and see the entity in context. This iteration is an important part of making sure that the data is accurate.

Cleaning the locations data

After reviewing the results alongside the text, you can remove some of the entities that are not referring to any locations. I have also noticed a missing location- “Soci”- from the text. I will add a row to the locations dataframe, and then update the locs_dist dataframe.

Now that I have a list of locations that occur at least once in the text, I can also check to see if there are any occurrences that the annotator has missed. I can do this in a number of ways, but I have included a function that simplifies the process, which requires a dataframe, a vector of locations to look for, and the column name where the text is located. This function will search the entire text column from our dataframe for locations in the location vector and return a new list of lists. The resulting list will be appended to the dataframe and a dataframe will be returned.

Now we can see the original locations column from the dataframe side-by-side with the new_locs column. If we print some example rows, we see that the function has picked out all instances of a specific location within the text that were missed by the original location annotator. At this point, our initial locations column in the text dataframe is no longer needed, so we can drop the column, and unnest the new_locs column. This will give us a dataframe with one row per location observation, meaning that some rows of text will be duplicated.

Step 3: Annotate the data

We have now found locations within the text, but we do not know the context in which they are being used. We also do not know if this a complete list of all locations. At this point, we need to do a closer reading.

For this project I am interested in recreating the route the author took. That means I need to know which locations she actually visited. If I open the data in Excel, I can scan the scan the text for context of location mentions, and code each observation. I might add supplemental data like whether the location is a city, a site, geographic feature, etc. I can also add a column to indicate whether a location was visited, glimpsed, or whether it is not relevant to the analysis. I will also add the order in which locations were visited. This is made easier by having the locations appear in a separate column along with where the occur in the text.

Now we can load the data back into R. We can also create separated dataframes that include for only relevant locations for the next step- geocoding.

Step 4: Geocode

We now have a filtered dataframe with all the locations that we are interested in for this analysis: points visited, or viewed. We can use this dataframe to geocode the locations that we have using the tidygeocoder package, which uses osm data for geolocating sites. To improve the accuracy, we can add a country column to give to the geocoder.

Step 5: Check the data- iterate

The geocoder didn’t return results for all of the points, but it returned quite a few. However, we need to check the accuracy of the geocode results. By plotting them along with a map of Italy we see that the geocoder did a pretty good job, but there are some points in Northern Italy that may be misidentified. It helps to be familiar with the text to know the range that your points should fall in. It also helps to join the coordinates dataframe with the full sentences dataframe, so we can have everything together.

Let’s plot everything on a map.

Let’s also join the dataframes so we can see the results next to the full text.

One point in question is the location Madonna del Sasso. In context, we can see that this is referring to a church near the river Corsalone. We can check an online map to verify the location and get the proper coordinates. It turns out that the geocoder did not give us the location we were looking for. Here is a function written by Tyler Morgan Wall that will simply reassign coordinates within a dataframe that has separate columns for latitude and longitude.

After finding all the correct coordinates, we can update the locations data by joining our dataframe with coordinates with the rest of the data and create a spatial object using the lon and lat columns as the X and Y coords with the st_as_sf function.

Now we can plot our spatial objects together.

Finally, I will filter out the points that I will be using for my own analysis- particularly the sites visited. I will use this data with QGIS to run a least cost path analysis to estimate the route that the author would have taken during her trip. I plan on documenting those steps in a future post. For now, I will create my filtered dataframe and save it as a shapefile.

Conclusion

R is a great tool for NLP, GIS and textual analysis. Combined with the NLP power of the spacy python package, R can be used to locate geographical entities within a text and geocode those results. This is a helpful tool in digital humanities research, as well as HGIS. I hope that this guide was useful in illustrating an iterative process for combining textual analysis, NER and GIS. Please comment if you have any ideas for making this process smoother, or any ideas for writing better code.

--

--

Geographer and data analyst. Interested in mapping and imagining geographies of the past and present. Linkedin: https://www.linkedin.com/in/gregg-sal