Geoparsing with Python

Mining text for geographical place-names and plotting them

Published in

Towards Data Science

6 min readDec 19, 2019

“The Quaker City in a Storm,” frontispiece from Mark Twain’s *The Innocent’s Abroad; Source: Project Gutenberg*

Geoparsing refers to the process of extracting place-names from text and matching those names unambiguously with proper nouns and spatial coordinates. These coordinates can then be plotted on a map in order to visualize the spatial footprint of the text in question. Geoparsing is a specific kind of procedure known in geography as toponym resolution: however, while both geoparsing and toponym resolution address the identification of concrete geographical entities in text, toponym resolution typically concerns the simple derivation of geography from structured, unambiguous references; an example is the post office’s use geocoded state abbreviations for routing. Geoparsing, on the other hand, indicates that the input text contains unstructured ambiguity; for example, a text may reference Georgia, both a country and a state in the southern USA, and it is the task of the geoparsing researcher to resolve that ambiguity.

In information science, literary analysis, and the digital humanities, the Edinburgh Geoparser has been in use since 2015 as an open-source tool employed strictly for the task of geoparsing historical documents and classical works. Available free online, the Edinburgh Geoparser allows researchers to follow the peregrinations of John Steinbeck’s Travels with Charley or George Orwell’s The Road to Wigan Pier not solely on paper, but on a computer screen. The program takes an input text, pinpoints location names, cross-references those names with a gazetteer — an atlas-like geographical directory — and plots the output on Google Maps. In this way, scholars can see the travels of a letter sender or a novel’s protagonist with God’s-eye view, each discrete destination in interactive panorama.

With the aid of just a handful of readily accessible packages, we can easily build our own rudimentary Edinburgh Geoparser in Python. What follows is an attempt at doing so. For this example, I will be parsing Mark Twain’s The Innocents Abroad. This is an apt text for the task: published in 1859, the novel was one of Twain’s most famous in his lifetime, and it remains one of the best-selling travelogues ever. From the United States to France, Odessa to Jerusalem, Twain and 60 compatriots from 15 different states traverse the globe, toponyms abounding throughout.

For this geoparser, the below packages will be needed. I will make it clear what each is doing as it is called in the script below.

>import numpy as np
>import matplotlib.pyplot as plt
>%matplotlib inline
>
>import pandas as pd
>import geopandas as gpd
>
>from urllib import request
>from geotext import GeoText
>
>from geopy.geocoders import Nominatim
>from geopy.exc import GeocoderTimedOut
>
>from shapely.geometry import Point, Polygon
>import descartes

The first thing to do is to grab The Innocents. The full text is available free online through the Gutenberg Project. Python’s natural language processing library NLTK natively provides 18 Gutenberg texts for practice. Unfortunately, The Innocents is not one of them, so I will use the requests object from the urllib module, pass the url for the raw .txt, and decode it in utf-8.

>url = "http://www.gutenberg.org/files/3176/3176-0.txt"
>response = request.urlopen(url)
>raw = response.read().decode('utf8')
>print(f'{type(raw)}, \n{len(raw)}, \n{raw[:501]}')<class 'str'>, 
1145397, 
Project Gutenberg's The Innocents Abroad, by Mark Twain (Samuel Clemens)

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net


Title: The Innocents Abroad

Author: Mark Twain (Samuel Clemens)

Release Date: August 18, 2006 [EBook #3176]
Last Updated: February 23, 2018

Language: English

Printed in my Jupyter Notebook console is the book’s title page; the entire text is just one large string of 1145397 characters. This string contains all the geographical references that I will attempt to mine. The automatic processing pipeline of the NLP package SpaCy includes name entity recognition, assigning countries, states, and cities the label “geopolitical entities.” However, I was unable to find a simple way in SpaCy to extract just cities—save for writing an iterative function that cross-references Wikipedia summaries—so I am instead using GeoText, which has an attribute for easily extracting city names.

>places = GeoText(raw)
>cities = list(places.cities)
>cities['Tangier',
 'Paris',
 'Temple',
 'Como',
 'Garibaldi',
 'Rome',
 'Roman',
 'Naples',
 'Naples',
 ...]

Like SpaCy, GeoText recognizes geographically named entities, but calling .cities in GeoText returns just the city names. Note that I am not removing duplicates entries of cities; I want to keep these so that when I plot them my map contains frequencies of reference.

Now, I need a gazetteer, in order concretely ground these city-names with unambiguous spatial coordinates. The Edinburgh Geoparser contains specialized gazetteers for historical place names, but I don’t have access to those. Instead, OpenStreetMap is a collaborative map inspired by the Wikipedia model: it is free and editable by anyone. OSM has a search tool called Nominatim, which, given a city or state name, reverse geocodes that name and returns its geographical coordinates. So I just need to give Nominatim the list of city names. There are many ways to do this, but one way is through the Python package GeoPy, which abstract’s Nominatim’s API and returns the coordinates contained in the first result of the input search.

>geolocator = Nominatim(timeout=2)
>
>lat_lon = []
>for city in cities: 
>    try:
>        location = geolocator.geocode(city)
>        if location:
>            print(location.latitude, location.longitude)
>            lat_lon.append(location)
>    except GeocoderTimedOut as e:
>        print("Error: geocode failed on input %s with message %s"%>
>             (city, e))
>lat_lon35.7642313 -5.81862599789511
48.8566101 2.3514992
31.098207 -97.3427847
45.9394759 9.14941014540895
-29.2562253 -51.5269167
41.894802 12.4853384
49.67887 4.31419
40.8359336 14.2487826
40.8359336 14.2487826

I set the API’s timeout to 2 seconds, and included an error-handling statement that skips the current search if the API times out. My hope is that if an individual search takes longer than 2 seconds, it’s not really a city name, and the script will thus weed out non-names from the final map. I put the city names and coordinates into a Pandas dataframe:

>df = pd.DataFrame(lat_lon, columns=['City Name', 'Coordinates'])
>df.head(7)

The coordinates, which are currently formatted as tuples, need to be transformed into point objects, as each city will be represented by a point on the map. The Python package Shapely does just that. Below, I’m iterating over the dataframe series that contains the coordinate tuples, turning them each into Points, and switching the order of latitude and longitude, only because that’s the order the map object I downloaded uses.

>geometry = [Point(x[1], x[0]) for x in df['Coordinates']]
>geometry[:7][<shapely.geometry.point.Point at 0x116fe3b10>,
 <shapely.geometry.point.Point at 0x116ff9190>,
 <shapely.geometry.point.Point at 0x116fe0c10>,
 <shapely.geometry.point.Point at 0x116fe0a10>,
 <shapely.geometry.point.Point at 0x116fe0250>,
 <shapely.geometry.point.Point at 0x116fe0850>,
 <shapely.geometry.point.Point at 0x116fe0210>]

Putting these point objects back into a dataframe will allow us to plot them easily. Now, instead of a pandas dataframe, I will use a geopandas dataframe, which can easily handle these types of objects.

>## coordinate system I'm using
>crs = {'init': 'epsg:4326'}
>
>## convert df to geo df
>geo_df = gpd.GeoDataFrame(df, crs=crs, geometry=geometry)
>geo_df.head()

The final step is to plot the map with the locations. The fact that I did not remove duplicate location entries means that, if I set the transparency of the markers, more frequently referenced locations will appear more opaque than those referenced just once. That’s being done with the alpha parameter in the plot object. Opaque markers should roughly correspond to cities where the USS Quaker City actually travelled, transparent ones to cites merely mentioned.

>## world map .shp file I downloaded
>countries_map =gpd.read_file('Countries_WGS84/
>                              Countries_WGS84.shp')
>
>f, ax = plt.subplots(figsize=(16, 16))
>countries_map.plot(ax=ax, alpha=0.4, color='grey')
>geo_df['geometry'].plot(ax=ax, markersize = 30, 
>                        color = 'b', marker = '^', alpha=.2)

And that’s Mark Twain’s The Innocents Abroad, geoparsed.

Geoparsing with Python

Mining text for geographical place-names and plotting them

Written by Michael Eby