
The massive amount of available text data has location features that can benefit from automated Information extraction. Natural Language Processing(NLP) has significantly advanced in the last five years. However, advances in Geographic information extraction from text is still in its nascency. In this tutorial, we use Python and NLP to Geoparse twitter dataset.
Geoparsing
Geoparsing is a toponym resolution process of converting free-text descriptions of places (such as "Two kilometres east of London") into geographic identifiers (coordinates with latitude and longitude).
Geoparsing is essential in many applications, including Geographic Information Retrieval (GIR), Geographic Information Extraction (GIE) and Geographic Information Analysis (GIA) tasks.
We can use Geoparsing to determine a document’s geographic scope, decoding locational information for disaster response, business news analysis, as well as other multiple domains.
To illustrate what geoparsing is, let us consider this satire headline example.
"Protesters Steal NYC Sanitation Trucks, Use Them To Block Trump Tower"
Typically, geoparsing contains two components: toponym recognition and toponym resolution. First, is the toponym extraction or recognition [NYC, Trump Tower]. The next step is linking toponyms to geographic coordinates [(40.768121, -73.981895), (40.762347, -73.973848)].
In the next section, we move into Geoparsing simple text with Mordecai, a Python Geoparsing library.
Geoparsing Example with Python
For this tutorial, we will use Mordecai library for geoparsing. Mordecai is Full-text geoparsing Python library. With this library, you can extract the place names from a piece of text, resolve them to the correct place, and return their coordinates and structured geographic information.
Let us start with a simple Geoparsing example. Mordecai Python Geoparsing library has Geoparse function that takes a text and returns structured geographic information from the text.
from mordecai import Geoparser
geo = Geoparser()
geo.geoparse("Eiffel Tower is located in Paris")
With any text input, Mordecai returns the locational features present in the text. In this example, it predicts correctly both the Eiffel tower as well as the city of Paris. Interestingly, the latitude and longitude associated with these two locations differ. The Eiffel tower coordinates predicted are more specific than the city of Paris.
[{'word': 'Eiffel Tower',
'spans': [{'start': 0, 'end': 12}],
'country_predicted': 'FRA',
'country_conf': 0.611725,
'geo': {'admin1': 'Île-de-France',
'lat': '48.85832',
'lon': '2.29452',
'country_code3': 'FRA',
'geonameid': '6254976',
'place_name': 'Tour Eiffel',
'feature_class': 'S',
'feature_code': 'MNMT'}},
{'word': 'Paris',
'spans': [{'start': 27, 'end': 32}],
'country_predicted': 'FRA',
'country_conf': 0.9881995,
'geo': {'admin1': 'Île-de-France',
'lat': '48.85339',
'lon': '2.34864',
'country_code3': 'FRA',
'geonameid': '2988506',
'place_name': 'Paris',
'feature_class': 'A',
'feature_code': 'ADM3'}}]
Mordecai Python library takes different steps to achieve this result. First, it uses spaCy‘s named entity recognition to extract place names from the text. Then, it uses Geonames gazetteer to find the potential coordinates for the place name. The final process uses neural networks to predict the country and placename from the gazetteer entries.
Geoparsing Tweets
To Geoparse tweets, first, we set up Tweepy API to scrape hashtags. The following piece of code takes a hashtag (#BlackLivesMatter) to scrape with Tweepy and saves all tweets from the hashtag into a local CSV file.
scraping tweets with Tweepy
Let us read the tweets CSV with Pandas and look at the first few columns.
df = pd.read_csv("tweets.csv", header=None, names=["date", "tweet"])
df.head()

The data frame holds now the date of the tweet and the text. Let us use Mordecai Geoparsing functionality to extract locational information and assign coordinates. We set up this function that takes a data frame and results in a clean data frame with additional locational information from the geoparsing.
Our clean dataset now has extracted place names and assigned coordinates to each place name in the tweet text with predictions and confidence level of the prediction.

To plot the geographic extent of the #BlackLivesMatter hashtag extracted with Mordecai Geoparsing Python library, we can use now any of your favourite Geospatial data visualization python libraries. I am using Plotly Express to plot the data.
fig = px.scatter_mapbox(df_clean, lat="lat", lon="lon", size_max=15, zoom=1, width=1000, height=800, mapbox_style="dark")
fig.data[0].marker = dict(size = 5, color="red")
fig

Conclusion
Geoparsing is an essential component in automating location feature extraction from text. In this tutorial, we have seen how to Geoparse text, using Mordecai Geoparsing Python Library.
To run Geoparsing with Mordecai, you need to install it. You also need to have a running docker container. You can find the installation instructions here.
The code for this tutorial is available in this Jupyter Notebook.