Traveling tourist

Part 1: Import WikiData to Neo4j with Neosemantics library

Import data from WikiData and Open Street Map API to create a knowledge graph in Neo4j

Tomaz Bratanic
Towards Data Science
9 min readAug 27, 2020

--

After a short summer break, I have prepared a new blog series. In this first part, we will construct a knowledge graph of monuments located in Spain. As you might know, I have lately gained a lot of interest and respect for the wealth of knowledge that is available through the WikiData API. We will continue honing our SPARQL syntax knowledge and fetch the information regarding the monuments located in Spain from the WikiData API. I wasn’t aware of this before, but scraping the RDF data available online and importing it into Neo4j is such a popular topic that Dr. Jesus Barrasa developed a Neosemantics library to help us with this process.

In the next parts of the series, we will take a look at the pathfinding algorithms available in the Neo4j Graph Data Science library.

Agenda

  1. Install Neosemantics library
  2. Graph model
  3. Construct WikiData SPARQL query
  4. Import RDF Graph
  5. Reverse Geocode with OSM API
  6. Verify Data

Install Neosemantics library

In this blog series, we will use the standard APOC and GDS libraries, which we can install with a single click in the Neo4j Desktop application. On top of that, we will add the Neosemantics library to our stack. It is used to interact with RDF data in the Neo4j environment. We can either import RDF data to Neo4j or export property graph model in RDF format.

To install the Neosemantics library, we download the latest release and save it to the Neo4j plugins folder. We also need to add the following line to the Neo4j configuration file.

dbms.unmanaged_extension_classes=n10s.endpoint=/rdf

We are now ready to start our Neo4j instance. First, we need to initiate the Neosemantics configuration with the following cypher procedure.

CALL n10s.graphconfig.init({handleVocabUris: "IGNORE"})

Take a look at the documentation for information about the configuration options.

Graph model

Monuments are in the center of our graph. We have their name and the URL of the image stored as a node property. The monuments have been influenced by various architectural styles, which we indicate as a relationship to an Architecture node. We will save the city and the state of the monument as a two-level hierarchical location tree.

Graph model created with draw.io

The Neosemantics library requires a unique constraint on the property “uri” of the nodes labeled Resource. We will also add indexes for the State and City nodes. The apoc.schema.assert procedure allows us to define many indexes and unique constraints with a single call.

CALL apoc.schema.assert(
{State:['id'], City:['id']},
{Resource:['uri']})

Construct WikiData SPARQL query

For me, the easiest way to construct a new SPARQL query is using the WikiData query editor. It has a lovely autocomplete feature. It also helps with query debugging.

We want to retrieve all the instances of monuments located in Spain. I have found the easiest way to find various entities on WikiData is by simply using Google search. You can then inspect all the available properties of the entity on the website. The SPARQL query I first came up looks like this:

SELECT * 
WHERE { ?item wdt:P31 wd:Q4989906 .
?item wdt:P17 wd:Q29 .
?item rdfs:label ?monumentName .
filter(lang(?monumentName) = "en")
?item wdt:P625 ?location .
?item wdt:P149 ?architecture .
?architecture rdfs:label ?architectureName .
filter(lang(?architectureName) = "en")
?item wdt:P18 ?image }

The first two lines in the WHERE clause define the entities we are looking for:

// Entity is an instance of monument entity
?item wdt:P31 wd:Q4989906 .
// Entity is located in Spain
?item wdt:P17 wd:Q29 .

Next, we also determine which properties of the entities we are interested in. In our case, we would like to retrieve the monument’s name, image, location, and architectural style. If we run this query in the query editor, we get the following results.

We have defined the information we would like to retrieve in the WHERE clause of the SPARQL query. We need to massage the data format a bit before we can import it with Neosemantics. The first and most crucial thing is to change the SELECTclause to CONSTRUCT. This way, we will get returned RDF triplets instead of a table of information. You can read more about the difference between SELECTand CONSTRUCTin this blog post written by Mark Needham.

With the Neosemantics library, we can preview what our stored graph model would look like with the n10s.rdf.preview.fetch procedure. We will start by inspecting the graph schema of an empty CONSTRUCT statement.

WITH '
CONSTRUCT
WHERE { ?item wdt:P31 wd:Q4989906 .
?item wdt:P17 wd:Q29 .
?item rdfs:label ?monumentName .
?item wdt:P625 ?location .
?item wdt:P149 ?architecture .
?architecture rdfs:label
?architectureName .
?item wdt:P18 ?image} limit 10 ' AS sparql
CALL n10s.rdf.preview.fetch(
"https://query.wikidata.org/sparql?query=" +
apoc.text.urlencode(sparql),"JSON-LD",
{ headerParams: { Accept: "application/ld+json"} ,
handleVocabUris: "IGNORE"})
YIELD nodes, relationships
RETURN nodes, relationships

Results

One problem is that the nodes have no labels. You could also notice that the relationship types are not very informative as P149 or P31 does not mean much if you don’t know the WikiData property mapping. Another thing, which is not very obvious from this visualization is that the URL of the image is stored as a separate node. If you remember the graph model from before, we decided we want to save the image URL as a property of the monument node.

I won’t go much into detail, but inside the CONSTRUCT clause, we can define how our graph schema should look like in Neo4j. We have also defined we want to save the URL of the image as a property of the monument node instead of a separate node with the following syntax:

?item wdt:P18 ?image . 
bind(str(?image) as ?imageAsStr)

We can now preview the new query with the updated CONSTRUCT statement.

WITH ' PREFIX sch: <http://schema.org/> 
CONSTRUCT{ ?item a sch:Monument;
sch:name ?monumentName;
sch:location ?location;
sch:img ?imageAsStr;
sch:ARCHITECTURE ?architecture.
?architecture a sch:Architecture;
sch:name ?architectureName }
WHERE { ?item wdt:P31 wd:Q4989906 .
?item wdt:P17 wd:Q29 .
?item rdfs:label ?monumentName .
filter(lang(?monumentName) = "en")
?item wdt:P625 ?location .
?item wdt:P149 ?architecture .
?architecture rdfs:label ?architectureName .
filter(lang(?architectureName) = "en")
?item wdt:P18 ?image .
bind(str(?image) as ?imageAsStr) } limit 100 ' AS sparql CALL n10s.rdf.preview.fetch(
"https://query.wikidata.org/sparql?query=" +
apoc.text.urlencode(sparql),"JSON-LD",
{ headerParams: { Accept: "application/ld+json"} ,
handleVocabUris: "IGNORE"})
YIELD nodes, relationships
RETURN nodes, relationships

Results

Import RDF graph

Now we can go ahead and import the graph to Neo4j. Instead of the n10s.rdf.preview.fetch procedure we use n10s.rdf.import.fetch and keep the rest of the query identical.

WITH 'PREFIX sch: <http://schema.org/> 
CONSTRUCT{ ?item a sch:Monument;
sch:name ?monumentName;
sch:location ?location;
sch:img ?imageAsStr;
sch:ARCHITECTURE ?architecture.
?architecture a sch:Architecture;
sch:name ?architectureName }
WHERE { ?item wdt:P31 wd:Q4989906 .
?item wdt:P17 wd:Q29 .
?item rdfs:label ?monumentName .
filter(lang(?monumentName) = "en")
?item wdt:P625 ?location .
?item wdt:P149 ?architecture .
?architecture rdfs:label ?architectureName .
filter(lang(?architectureName) = "en")
?item wdt:P18 ?image .
bind(str(?image) as ?imageAsStr) }' AS sparql
CALL n10s.rdf.import.fetch(
"https://query.wikidata.org/sparql?query=" +
apoc.text.urlencode(sparql),"JSON-LD",
{ headerParams: { Accept: "application/ld+json"} ,
handleVocabUris: "IGNORE"})
YIELD terminationStatus, triplesLoaded
RETURN terminationStatus, triplesLoaded

Let’s start with some exploratory graph queries. We will first count the number of monuments in our graph.

MATCH (n:Monument) 
RETURN count(*)

We have imported 1401 monuments into our graph. We will continue with counting the number of monuments grouped by an architectural style.

MATCH (n:Architecture) 
RETURN n.name as monument,
size((n)<--()) as count
ORDER BY count DESC
LIMIT 5

Results

Romanesque and Gothic architecture styles influence the most monuments. While I was exploring WikiData, I have noticed that an architectural style can be a subclass of other architectural styles. As an exercise, we will import the architectural hierarchy relationships to our graph. In our query, we will iterate over all architecture styles stored in our graph, and fetch any parent architectural style from WikiData and save it back to our graph.

MATCH (a:Architecture) 
WITH ' PREFIX sch: <http://schema.org/>
CONSTRUCT { ?item a sch:Architecture;
sch:SUBCLASS_OF ?style.
?style a sch:Architecture;
sch:name ?styleName;}
WHERE { filter (?item = <' + a.uri + '>)
?item wdt:P279 ?style .
?style rdfs:label ?styleName
filter(lang(?styleName) = "en") } ' AS sparql
CALL n10s.rdf.import.fetch(
"https://query.wikidata.org/sparql?query=" +
apoc.text.urlencode(sparql),"JSON-LD",
{ headerParams: { Accept: "application/ld+json"} ,
handleVocabUris: "IGNORE"})
YIELD terminationStatus, triplesLoaded
RETURN terminationStatus, triplesLoaded

We can now look at some examples of architectural hierarchy.

MATCH (a:Architecture)-[:SUBCLASS_OF]->(b:Architecture)
RETURN a.name as child_architecture,
b.name as parent_architecture
LIMIT 5

Results

It seems that modernism is a child category of Art Noveau, and Art Noveau is a child category of decorative arts.

Spatial enrichment

At first, I wanted to include the municipality information of monuments available on WikiData, but as it turned out, this information is relatively sparse. No worries though, I later realized we could use the reverse geocode API to retrieve this information. APOC has a dedicated procedure available for reverse geocoding. By default, it uses Open Street Map API, but we can customize it to work with other providers as well. Check the documentation for more information.

First, we have to transform the location information to a spatial point data type.

MATCH (m:Monument) 
WITH m,
split(substring(m.location, 6, size(m.location) - 7)," ") as point
SET m.location_point = point(
{latitude: toFloat(point[1]),
longitude: toFloat(point[0])})

Check a sample response from the OSM reverse GeoCode API.

MATCH (m:Monument)
WITH m LIMIT 1
CALL apoc.spatial.reverseGeocode(
m.location_point.latitude,
m.location_point.longitude)
YIELD data
RETURN data

Results

{   
"country": "España",
"country_code": "es",
"isolated_dwelling": "La Castanya",
"historic": "Monument als caiguts en atac Carlista 1874",
"road": "Camí de Collformic a Matagalls",
"city": "el Brull",
"municipality": "Osona",
"county": "Barcelona",
"postcode": "08559",
"state": "Catalunya"
}

Open Street Map API is a tad interesting as it differs between cities, towns, and villages. Also, the monuments located in the Canaries have no state available but are a part of the Canaries archipelago. We will treat archipelago as a state and lump city, town, and village under a single label City. For batching purposes, we will use the apoc.periodic.iterate procedure.

CALL apoc.periodic.iterate(
'MATCH (m:Monument) RETURN m',
'WITH m
CALL apoc.spatial.reverseGeocode(
m.location_point.latitude,m.location_point.longitude)
YIELD data
WHERE (data.state IS NOT NULL OR
data.archipelago IS NOT NULL)
MERGE (s:State{id:coalesce(data.state, data.archipelago)})
MERGE (c:City{id:coalesce(data.city, data.town,
data.village, data.county)})
MERGE (c)-[:IS_IN]->(s)
MERGE (m)-[:IS_IN]->(c)',
{batchSize:10})

This query will take a long time because the default throttle delay setting is five seconds. If you don’t have the time to wait, I have saved the spatial results to GitHub, and you can easily import them with the following query in less than five seconds.

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/blogs/master/Traveling_tourist/traveling_tourist_cities.csv" as row
MATCH (m:Monument{uri:row.uri})
MERGE (c:City{id:row.city})
MERGE (s:State{id:row.state})
MERGE (m)-[:IS_IN]->(c)
MERGE (c)-[:IS_IN]->(s);

We will first inspect if there are any missing spatial values for monuments.

MATCH (m:Monument) 
RETURN exists ((m)-[:IS_IN]->()) as has_location,
count(*) as count

Results

We have retrieved the spatial information for almost all of the monuments. Something you need to be careful when creating location hierarchy trees is that every node in the tree has only a single outgoing relationship to its parent. If this rule is broken, the structural integrity of the location tree is lost, as some entities will have more than a single location. Check my Location trees post for more information on how to circumnavigate this problem.

MATCH (c:City)
WHERE size((c)-[:IS_IN]->()) > 1
RETURN c

Luckily we don’t run into this problem here. We can now explore the results of our spatial enrichment. We will look at the count of monuments grouped by architectural style located in Catalunya.

MATCH (s:State{id:'Catalunya'})<-[:IS_IN*2..2]-(:Monument)-[:ARCHITECTURE]->(architecture)
RETURN architecture.name as architecture,
count(*) as count
ORDER BY count DESC
LIMIT 5

Results

Let’s quickly look at the WikiData definition of vernacular architecture for educational purposes.

Vernacular architecture is architecture characterized by the use of local materials and knowledge, usually without the supervision of professional architects.

We can also look at the most frequent architectural style of monuments by states. We will use the subquery syntax introduced in Neo4j 4.0.

MATCH (s:State)
CALL {
WITH s
MATCH (s)<-[:IS_IN*2..2]-()-[:ARCHITECTURE]->(a)
RETURN a.name as architecture, count(*) as count
ORDER BY count DESC LIMIT 1
}
RETURN s.id as state, architecture, count
ORDER BY count DESC
LIMIT 5

Results

Conclusion

If you have followed the steps in this post, your graph should look something like in the above picture. I was always impressed by how easy it is to fetch data from various APIs using only cypher. And if you want to call any custom endpoint, you can still use the apoc.load.json procedure.

In the next part, we will dig into the pathfinding algorithms. In the meantime, try Neo4j and join the Twin4j newsletter.

As always, the code is available on GitHub.

--

--

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN