Traveling tourist
Exploring Pathfinding Graph Algorithms
A deep dive into pathfinding algorithms available in Neo4j Graph Data Science library
In the first part of the series, we constructed a knowledge graph of monuments located in Spain from WikiData API. Now we’ll put on our graph data science goggles and explore various pathfinding algorithms available in the Neo4j Graph Data Science library. To top it off, we’ll look at a brute force solution for a Santa Claus problem. Now, you might wonder what a Santa Claus problem is. It is a variation of the traveling salesman problem, except we don’t require the solution to end in the same city as it started. This is because of the Santa Claus’ ability to bend the time-space continuum and instantly fly back to the North Pole once he’s finished with delivering goodies.
Agenda
- Infer spatial network of monuments
- Load the in-memory projected graph with cypher projection
- Weakly connected component algorithm
- Shortest path algorithm
- Yen’s k-shortest path algorithm
- Single source shortest paths algorithm
- Minimum spanning tree algorithm
- Random walk algorithm
- Traveling salesman problem
- Conclusion
Infer spatial network of monuments
Currently, we have no direct relationships between the monuments in our graph. We do, however, have their GPS locations, which allows us to identify which monuments are nearby. This way, we can infer a spatial network of monuments.
The process is very similar to inferring a similarity network. We usually don’t want to end up with a complete graph, where each node is connected to all the other ones. It would defeat the purpose of demonstrating pathfinding algorithms as the shortest path between any two nodes would always be a straight line, which would be represented as a direct relationship between the two nodes. In our case, we will connect each monument to the five closest monuments that are less than 100 kilometers away. These two numbers are entirely arbitrary. You can pick any other depending on your scenario.
MATCH (m1:Monument),(m2:Monument)
WHERE id(m1) > id(m2)
WITH m1,m2, distance(m1.location_point,m2.location_point) as distance
ORDER BY distance ASC
WHERE distance < 100000
WITH m1,collect({node:m2,distance:distance})[..5] as nearest
UNWIND nearest as near
WITH m1, near, near.node as nearest_node
MERGE (m1)-[m:NEAR]-(nearest_node) SET m.distance = near.distance
Load the in-memory projected graph with cypher projection
Let’s just quickly refresh how does the GDS library work.
The graph analytics pipeline consists of three parts. In the first part, the graph loader reads the stored graph from Neo4j and loads it as an in-memory projected graph. We can use either native projection or cypher projection to load the projected graph. In the second step, we execute the graph algorithms in sequence. We can use the results of one graph algorithm as an input to another. Last but not least, we store or stream the results back to Neo4j.
Here, we will use the cypher projection to load the in-memory graph. I suggest you take a look at the official documentation for more details regarding how it works. In the node statement, we will describe all monuments in our graph and add their architecture style as a node label. Adding a custom node label will allow us to filter nodes by architectural style at algorithm execution time. In the relationship statement, we will describe all the links between monuments and include the distance property, that we will use as a relationship weight.
CALL gds.graph.create.cypher('monuments',
'MATCH (m:Monument)-[:ARCHITECTURE]->(a)
RETURN id(m) as id, collect(a.name) as labels',
'MATCH (m1:Monument)-[r:NEAR]-(m2:Monument)
RETURN id(m1) as source, id(m2) as target, r.distance as distance')
Weakly connected component algorithm
Even though the weakly connected component algorithm is not a pathfinding algorithm, it is part of almost every graph analysis. It is used to find disconnected components or islands within our graph. We’ll begin by running the stats
mode of the algorithm.
CALL gds.wcc.stats('monuments')
YIELD componentCount, componentDistribution
Results
There are six separate components within our monuments network. The results are typical for a real-world dataset. We have a single super component that contains 98% of all nodes and a couple of tiny islands floating around. Let’s examine the smaller components.
CALL gds.wcc.stream('monuments')
YIELD nodeId, componentId
WITH componentId, gds.util.asNode(nodeId) as node
OPTIONAL MATCH (node)-[:IS_IN*2..2]->(state)
RETURN componentId,
count(*) as component_size,
collect(node.name) as monuments,
collect(distinct state.id) as state
ORDER BY component_size DESC
SKIP 1
Results
Three of the five smaller components are located in the Canaries archipelago, and one is located in the Balearic Islands, specifically on Majorca. With the Neomap application, developed by Estelle Scifo, we can visualize the Canaries archipelago components on a map.
One component spans over two monuments on Fuerteventura and Lanzarote. The second one consists of a couple of monuments located on Tenerife and Gran Canaria. On the left, there is a single monument on El Hierro Island. They are separate components because there is no link between them. The absence of a connection between the components implies that there are more than 100 kilometers away because that is the threshold we chose when we inferred the spatial network.
P.s. If you like any water activities, I highly recommend visiting the Canaries.
Shortest Path algorithm
The first pathfinding graph algorithm we will use is the Shortest Path algorithm. It finds the shortest weighted path between two nodes. We define the start node and the end node and specify which relationship weight property should the algorithm take into consideration when calculating the shortest path.
MATCH (s:Monument{name:'Iglesia de Santo Domingo'}),
(e:Monument{name:'Colegiata de Santa María de Piasca'})
CALL gds.alpha.shortestPath.stream('monuments',{
startNode:s,
endNode:e,
relationshipWeightProperty:'distance'})
YIELD nodeId, cost
RETURN gds.util.asNode(nodeId).name as monument, cost
Results
The cost is expressed as the distance in meters. We can visualize the shortest path with a slightly modified version of Neomap. I have customized the popup of the monuments to include its image and the architectural style.
You might observe that we skip the Santa Cruz de Cangas de Onís monument, which is located in the middle right of the image. A slight detour will result in a longer path than just traversing in a straight line from Iglesia de San Emeterio to Santo Toribio de Liébana.
What if we wanted to plan a trip for an architectural class and visit only monuments that were influenced by either Gothic or Romanesque architecture along the way? Planning such a trip is very easy with the GDS library, as we can filter which nodes can the algorithm visit with the nodeLabels
parameter.
MATCH (s:Monument{name:'Iglesia de Santo Domingo'}),
(t:Monument{name:'Colegiata de Santa María de Piasca'})
CALL gds.alpha.shortestPath.stream('monuments',{
startNode:s,
endNode:t,
relationshipWeightProperty:'distance',
nodeLabels:['Gothic architecture','Romanesque architecture']})
YIELD nodeId, cost
RETURN gds.util.asNode(nodeId).name as monument, cost
Results
The route is a bit different this time as the algorithm can only visit monuments that were influenced by Gothic or Romanesque architecture style.
Yen’s k-shortest path algorithm
We have learned how to calculate the shortest weighted path between a pair of nodes. What if we were more cautious tourists and wanted to find the top three shortest paths? Having a backup plan if something unexpected might happen along the way is always a good idea. In this scenario, we could use the Yen’s k-shortest path algorithm. The syntax is almost identical to the Shortest Path algorithm, except for the added k
parameter, which defines how many shortest paths we would like to find.
MATCH (s:Monument{name:'Iglesia de Santo Domingo'}),
(t:Monument{name:'Colegiata de Santa María de Piasca'})
CALL gds.alpha.kShortestPaths.stream('monuments',{
startNode:s,
endNode:t,
k:3,
relationshipWeightProperty:'distance'})
YIELD index,nodeIds,costs
RETURN index,[nodeId in nodeIds | gds.util.asNode(nodeId).name] as monuments,apoc.coll.sum(costs) as total_cost
Results
The three paths are almost the same length, just a couple hundred meters of difference. If you look closely, only the second stop is different among the three variants. Such a small difference can be attributed to the nature of our spatial network and the example pair of nodes.
Single source shortest path algorithm
With the Single Source Shortest Path algorithm, we define the start node and search for the shortest weighted path to all the other nodes in the network. We’ll inspect one of the Canaries components to limit the number of shortest paths to a reasonable number.
We’ll examine the Tenerife — Gran Canaria component and choose the Cathedral of La Laguna as the starting node. The algorithm tries to find the shortest paths to all the other nodes in the network, and if no such way exists, it returns Infinity value as a result. We will filter out the unreachable nodes with the gds.util.isFinite
procedure.
MATCH (start:Monument{name:’Cathedral of La Laguna’})
CALL gds.alpha.shortestPaths.stream(‘monuments’,
{startNode:start, relationshipWeightProperty:’distance’})
YIELD nodeId, distance
WHERE gds.util.isFinite(distance) = True
RETURN gds.util.asNode(nodeId).name as monument,distance
ORDER BY distance ASC
Results
The closest monument to the Cathedral of La Laguna is the Iglesia de la Concepción, which is just 420 meters away. It seems that there are two Iglesia de la Concepción on Tenerife Island as we can observe that it shows up twice in our results. The farthest reachable monument in our network from the Cathedral of La Laguna is Basilica of San Juan Bautista.
If we wanted to find the cost of the shortest path to all the reachable neoclassical monuments from the Cathedral of La Laguna, we could effortlessly achieve this with the nodeLabels
parameter.
MATCH (start:Monument{name:'Cathedral of La Laguna'})
CALL gds.alpha.shortestPaths.stream('monuments',
{startNode:start, relationshipWeightProperty:'distance',
nodeLabels:['Neoclassical architecture']})
YIELD nodeId, distance
WHERE gds.util.isFinite(distance) = True
RETURN gds.util.asNode(nodeId).name as monument,
distance
ORDER BY distance ASC
Results
It seems there are only four neoclassical monuments on Tenerife and Gran Canaria islands.
Minimum weight spanning tree algorithm
The Minimum Weight Spanning Tree algorithm starts from a given node and calculates a spanning tree connecting all reachable nodes with the minimum possible sum of relationship weights. For example, if we wanted to connect all the monuments in Tenerife and Gran Canaria with an optical or electric cable, we would use the Minimum Weight Spanning Tree algorithm.
MATCH (start:Monument{name:’Cathedral of La Laguna’})
CALL gds.alpha.spanningTree.minimum.write(‘monuments’,{
startNodeId:id(start),
relationshipWeightProperty:’distance’,
weightWriteProperty:’cost’})
YIELD effectiveNodeCount
RETURN effectiveNodeCount
Results
Currently, only the write
mode of the algorithm is available. We can visualize our potential cable route with Neomap.
Random walk algorithm
We can imagine the Random Walk algorithm trying to mimic a drunk crowd traversing the network. They might go left, or right, take two steps forward, one step back, we never really know. It depends on how drunk the crowd is. We can use this algorithm to provide random trip recommendations. Imagine we have just visited the University of Barcelona historical building and are not sure which monuments we should take a look at next.
MATCH (m:Monument{name:"University of Barcelona historical building"})
CALL gds.alpha.randomWalk.stream('monuments',
{start:id(m), walks:3, steps:5})
YIELD nodeIds
RETURN [nodeId in nodeIds | gds.util.asNode(nodeId).name] as result
Results
Remember, we mentioned that the Random Walk algorithm tries to mimic a drunk person traversing the network. Well, an intoxicated person might visit the same monument twice and not care. For example, in the first and third suggestions, a single monument shows up twice. Luckily, we have some options to influence how the algorithm should traverse the network in the node2vec mode with the following two parameters:
return: This parameter controls the likelihood of immediately revisiting a node in a walk. Setting it to a high value (> max(inOut, 1)) ensures that we are less likely to sample an already visited node in the following two steps.
inOut: This parameter allows the search to differentiate between “inward” and “outward” nodes. If inOut > 1, the random walk is biased towards nodes close to node t. In contrast, if inOut < 1, the walk is more inclined to visit nodes that are further away from the node t.
The definition of the two parameters is summarized from the original Node2vec paper.
We want to recommend monuments close to our starting point, so we choose the inOut
parameter to be greater than 1. And we definitely would like to avoid revisiting an already visited node during the walk, so we choose the return
parameter to be greater than the inOut
parameter.
MATCH (m:Monument{name:"University of Barcelona historical building"})
CALL gds.alpha.randomWalk.stream('monuments',
{start:id(m), walks:3, steps:5,
mode:'node2vec', inOut:5, return:10})
YIELD nodeIds
RETURN [nodeId in nodeIds | gds.util.asNode(nodeId).name] as result
Results
Unfortunately, the return
parameter ensures that we are less likely to sample an already visited node in the following two steps. This means that we can’t be sure that duplicates won’t show up later during our walk. In our example, Casa Batlló appears twice in the first suggestion. We can circumnavigate this problem by creating longer walk suggestions and filtering out duplicates before showing the results to the user. In the following query, we calculate nine steps long walks, filter out duplicates, and return only the first five results.
MATCH (m:Monument{name:"University of Barcelona historical building"})
CALL gds.alpha.randomWalk.stream('monuments',
{start:id(m), walks:3, steps:9,
mode:'node2vec', inOut:5, return:10})
YIELD nodeIds
RETURN apoc.coll.toSet([nodeId in nodeIds | gds.util.asNode(nodeId).name])[..5] as result
Results
This way, we make sure the results never contain duplicates. Now we can visualize the results with our trip recommendation application.
Traveling salesman problem
To top it off, we will solve the Santa Claus variation of the traveling salesman problem. As mentioned, the only difference is that we omit the requirement to end up in the same location as we started. I found the inspiration for this problem in the Gaming the Christmas Market post written by David Barton. I give all the credits to David Barton for conjuring up the solution. My contribution is to update it to work with Neo4j 4.0 and the GDS library.
Say we want to find the optimal route between this monuments:
:param selection => ["Castell de Santa Pau","Castell de Sant Jaume","Castell de Vilaüt","Castell de Sarraí","Castell de Solius","Portal d'Albanyà","Castell de Sant Gregori","Casa Frigola"]
We split the solution into two steps. First, we calculate the shortest path between all pairs of selected monuments with the gds.alpha.shortestPath
algorithm and store the results as the SHORTEST_ROUTE_TO relationship between the given pair of nodes. We save the total cost and all the intermediate nodes along the shortest path as the properties of the SHORTEST_ROUTE_TO relationship.
WITH $selection as selection
MATCH (c:Monument)
WHERE c.name in selection
WITH collect(c) as monuments
UNWIND monuments as c1
WITH c1,
[c in monuments where c.name > c1.name | c] as c2s,
monuments
UNWIND c2s as c2
CALL gds.alpha.shortestPath.stream('monuments',{startNode:c1,endNode:c2,relationshipWeightProperty:'distance'})
YIELD nodeId, cost
WITH c1,
c2,
max(cost) as totalCost,
collect(nodeId) as shortestHopNodeIds
MERGE (c1) -[r:SHORTEST_ROUTE_TO]- (c2)
SET r.cost = totalCost,
r.shortestHopNodeIds = shortestHopNodeIds
After completing the first step, we have created a complete graph of SHORTEST_ROUTE_TO relationships between the selected monuments.
In the second step, we will use the apoc.path.expandConfig
procedure. It enables us to perform variable-length path traversals with fine-grained control over the traversals. Check out the documentation for more details.
We allow the procedure to traverse only SHORTEST_ROUTE_TO relationships with the relationshipFilter
parameter and visit only the selected monuments with the whitelistNodes
parameter. We ensure that all selected nodes must be visited exactly once by defining the number of hops or levels traversed (minLevel
and maxLevel
) and with the uniqueness
parameter. I know it is a lot to comprehend, and if you need some help, I would suggest asking questions on the Neo4j community site. We then select the path with the minimum sum of relationship weights as the solution. Because we calculate all the possible routes between the chosen monuments, this is a brute-force solution of the traveling salesman problem.
WITH $selection as selection
MATCH (c:Monument)
WHERE c.name in selection
WITH collect(c) as monuments
UNWIND monuments as c1
WITH c1,
[c in monuments where c.name > c1.name | c] as c2s,
monuments,
(size(monuments) - 1) as level
UNWIND c2s as c2
CALL apoc.path.expandConfig(c1, {
relationshipFilter: 'SHORTEST_ROUTE_TO',
minLevel: level,
maxLevel: level,
whitelistNodes: monuments,
terminatorNodes: [c2],
uniqueness: 'NODE_PATH'})
YIELD path
WITH path,
reduce(cost = 0, x in relationships(path) | cost + x.cost) as totalCost
ORDER BY totalCost LIMIT 1
WITH path,
totalCost,
apoc.coll.flatten([r in relationships(path) | r.shortestHopNodeIds]) as intermediate_stops,
[n in nodes(path) | id(n)] as node_ids
RETURN [n in nodes(path) | n.name] as path,
round(totalCost) as total_distance,
[optional in intermediate_stops where not optional in node_ids | gds.util.asNode(optional).name] as optional_stops
Results
In the path column of the results, we have an ordered array of selected monuments to visit. Our travel would start with Castell de Sant Jaume and continue to Castell de Vilaüt and so on. We could dub this the Spanish castle-visiting trip as we selected six castles, and we have an option to see four more along the way. The total air distance of the path is 126 kilometers. Let’s visualize the results with our trip recommendation application.
Red markers are the selected monuments, and the blue markers are the optional stops along the way.
Conclusion
We have demonstrated most of the pathfinding algorithm available in the Neo4j Graph Data Science library with some real world use cases. The only puzzle left in this series is to finish the trip recommendation application. I have a plan to show off the application in the part three of the series. Till then, I encourage you to play around with various GDS library algorithm or try to recreate this series on a Neo4j sandbox instance. If you have any further questions, there are a bunch of Neo4j experts ready to help you on Neo4j community site.
As always, the code is available on GitHub.