Traveling tourist

Exploring Pathfinding Graph Algorithms

A deep dive into pathfinding algorithms available in Neo4j Graph Data Science library

Tomaz Bratanic
Towards Data Science
13 min readAug 30, 2020

--

In the first part of the series, we constructed a knowledge graph of monuments located in Spain from WikiData API. Now we’ll put on our graph data science goggles and explore various pathfinding algorithms available in the Neo4j Graph Data Science library. To top it off, we’ll look at a brute force solution for a Santa Claus problem. Now, you might wonder what a Santa Claus problem is. It is a variation of the traveling salesman problem, except we don’t require the solution to end in the same city as it started. This is because of the Santa Claus’ ability to bend the time-space continuum and instantly fly back to the North Pole once he’s finished with delivering goodies.

Agenda

  1. Infer spatial network of monuments
  2. Load the in-memory projected graph with cypher projection
  3. Weakly connected component algorithm
  4. Shortest path algorithm
  5. Yen’s k-shortest path algorithm
  6. Single source shortest paths algorithm
  7. Minimum spanning tree algorithm
  8. Random walk algorithm
  9. Traveling salesman problem
  10. Conclusion

Infer spatial network of monuments

Currently, we have no direct relationships between the monuments in our graph. We do, however, have their GPS locations, which allows us to identify which monuments are nearby. This way, we can infer a spatial network of monuments.

The process is very similar to inferring a similarity network. We usually don’t want to end up with a complete graph, where each node is connected to all the other ones. It would defeat the purpose of demonstrating pathfinding algorithms as the shortest path between any two nodes would always be a straight line, which would be represented as a direct relationship between the two nodes. In our case, we will connect each monument to the five closest monuments that are less than 100 kilometers away. These two numbers are entirely arbitrary. You can pick any other depending on your scenario.

MATCH (m1:Monument),(m2:Monument) 
WHERE id(m1) > id(m2)
WITH m1,m2, distance(m1.location_point,m2.location_point) as distance
ORDER BY distance ASC
WHERE distance < 100000
WITH m1,collect({node:m2,distance:distance})[..5] as nearest
UNWIND nearest as near
WITH m1, near, near.node as nearest_node
MERGE (m1)-[m:NEAR]-(nearest_node) SET m.distance = near.distance

Load the in-memory projected graph with cypher projection

Let’s just quickly refresh how does the GDS library work.

The image is borrowed from the official documentation.

The graph analytics pipeline consists of three parts. In the first part, the graph loader reads the stored graph from Neo4j and loads it as an in-memory projected graph. We can use either native projection or cypher projection to load the projected graph. In the second step, we execute the graph algorithms in sequence. We can use the results of one graph algorithm as an input to another. Last but not least, we store or stream the results back to Neo4j.

Here, we will use the cypher projection to load the in-memory graph. I suggest you take a look at the official documentation for more details regarding how it works. In the node statement, we will describe all monuments in our graph and add their architecture style as a node label. Adding a custom node label will allow us to filter nodes by architectural style at algorithm execution time. In the relationship statement, we will describe all the links between monuments and include the distance property, that we will use as a relationship weight.

CALL gds.graph.create.cypher('monuments',
'MATCH (m:Monument)-[:ARCHITECTURE]->(a)
RETURN id(m) as id, collect(a.name) as labels',
'MATCH (m1:Monument)-[r:NEAR]-(m2:Monument)
RETURN id(m1) as source, id(m2) as target, r.distance as distance')

Weakly connected component algorithm

Even though the weakly connected component algorithm is not a pathfinding algorithm, it is part of almost every graph analysis. It is used to find disconnected components or islands within our graph. We’ll begin by running the stats mode of the algorithm.

CALL gds.wcc.stats('monuments') 
YIELD componentCount, componentDistribution

Results

Weakly connected component stats results

There are six separate components within our monuments network. The results are typical for a real-world dataset. We have a single super component that contains 98% of all nodes and a couple of tiny islands floating around. Let’s examine the smaller components.

CALL gds.wcc.stream('monuments')
YIELD nodeId, componentId
WITH componentId, gds.util.asNode(nodeId) as node
OPTIONAL MATCH (node)-[:IS_IN*2..2]->(state)
RETURN componentId,
count(*) as component_size,
collect(node.name) as monuments,
collect(distinct state.id) as state
ORDER BY component_size DESC
SKIP 1

Results

Smaller weakly connected components members

Three of the five smaller components are located in the Canaries archipelago, and one is located in the Balearic Islands, specifically on Majorca. With the Neomap application, developed by Estelle Scifo, we can visualize the Canaries archipelago components on a map.

One component spans over two monuments on Fuerteventura and Lanzarote. The second one consists of a couple of monuments located on Tenerife and Gran Canaria. On the left, there is a single monument on El Hierro Island. They are separate components because there is no link between them. The absence of a connection between the components implies that there are more than 100 kilometers away because that is the threshold we chose when we inferred the spatial network.

P.s. If you like any water activities, I highly recommend visiting the Canaries.

Shortest Path algorithm

The first pathfinding graph algorithm we will use is the Shortest Path algorithm. It finds the shortest weighted path between two nodes. We define the start node and the end node and specify which relationship weight property should the algorithm take into consideration when calculating the shortest path.

MATCH (s:Monument{name:'Iglesia de Santo Domingo'}),  
(e:Monument{name:'Colegiata de Santa María de Piasca'})
CALL gds.alpha.shortestPath.stream('monuments',{
startNode:s,
endNode:e,
relationshipWeightProperty:'distance'})
YIELD nodeId, cost
RETURN gds.util.asNode(nodeId).name as monument, cost

Results

The cost is expressed as the distance in meters. We can visualize the shortest path with a slightly modified version of Neomap. I have customized the popup of the monuments to include its image and the architectural style.

You might observe that we skip the Santa Cruz de Cangas de Onís monument, which is located in the middle right of the image. A slight detour will result in a longer path than just traversing in a straight line from Iglesia de San Emeterio to Santo Toribio de Liébana.

What if we wanted to plan a trip for an architectural class and visit only monuments that were influenced by either Gothic or Romanesque architecture along the way? Planning such a trip is very easy with the GDS library, as we can filter which nodes can the algorithm visit with the nodeLabels parameter.

MATCH (s:Monument{name:'Iglesia de Santo Domingo'}),  
(t:Monument{name:'Colegiata de Santa María de Piasca'})
CALL gds.alpha.shortestPath.stream('monuments',{
startNode:s,
endNode:t,
relationshipWeightProperty:'distance',
nodeLabels:['Gothic architecture','Romanesque architecture']})
YIELD nodeId, cost
RETURN gds.util.asNode(nodeId).name as monument, cost

Results

The route is a bit different this time as the algorithm can only visit monuments that were influenced by Gothic or Romanesque architecture style.

Yen’s k-shortest path algorithm

We have learned how to calculate the shortest weighted path between a pair of nodes. What if we were more cautious tourists and wanted to find the top three shortest paths? Having a backup plan if something unexpected might happen along the way is always a good idea. In this scenario, we could use the Yen’s k-shortest path algorithm. The syntax is almost identical to the Shortest Path algorithm, except for the added k parameter, which defines how many shortest paths we would like to find.

MATCH (s:Monument{name:'Iglesia de Santo Domingo'}),
(t:Monument{name:'Colegiata de Santa María de Piasca'})
CALL gds.alpha.kShortestPaths.stream('monuments',{
startNode:s,
endNode:t,
k:3,
relationshipWeightProperty:'distance'})
YIELD index,nodeIds,costs
RETURN index,[nodeId in nodeIds | gds.util.asNode(nodeId).name] as monuments,apoc.coll.sum(costs) as total_cost

Results

Yen’s k-shortest path algorithm results

The three paths are almost the same length, just a couple hundred meters of difference. If you look closely, only the second stop is different among the three variants. Such a small difference can be attributed to the nature of our spatial network and the example pair of nodes.

Single source shortest path algorithm

With the Single Source Shortest Path algorithm, we define the start node and search for the shortest weighted path to all the other nodes in the network. We’ll inspect one of the Canaries components to limit the number of shortest paths to a reasonable number.

We’ll examine the Tenerife — Gran Canaria component and choose the Cathedral of La Laguna as the starting node. The algorithm tries to find the shortest paths to all the other nodes in the network, and if no such way exists, it returns Infinity value as a result. We will filter out the unreachable nodes with the gds.util.isFinite procedure.

MATCH (start:Monument{name:’Cathedral of La Laguna’})
CALL gds.alpha.shortestPaths.stream(‘monuments’,
{startNode:start, relationshipWeightProperty:’distance’})
YIELD nodeId, distance
WHERE gds.util.isFinite(distance) = True
RETURN gds.util.asNode(nodeId).name as monument,distance
ORDER BY distance ASC

Results

The closest monument to the Cathedral of La Laguna is the Iglesia de la Concepción, which is just 420 meters away. It seems that there are two Iglesia de la Concepción on Tenerife Island as we can observe that it shows up twice in our results. The farthest reachable monument in our network from the Cathedral of La Laguna is Basilica of San Juan Bautista.

If we wanted to find the cost of the shortest path to all the reachable neoclassical monuments from the Cathedral of La Laguna, we could effortlessly achieve this with the nodeLabels parameter.

MATCH (start:Monument{name:'Cathedral of La Laguna'})
CALL gds.alpha.shortestPaths.stream('monuments',
{startNode:start, relationshipWeightProperty:'distance',
nodeLabels:['Neoclassical architecture']})
YIELD nodeId, distance
WHERE gds.util.isFinite(distance) = True
RETURN gds.util.asNode(nodeId).name as monument,
distance
ORDER BY distance ASC

Results

It seems there are only four neoclassical monuments on Tenerife and Gran Canaria islands.

Minimum weight spanning tree algorithm

The Minimum Weight Spanning Tree algorithm starts from a given node and calculates a spanning tree connecting all reachable nodes with the minimum possible sum of relationship weights. For example, if we wanted to connect all the monuments in Tenerife and Gran Canaria with an optical or electric cable, we would use the Minimum Weight Spanning Tree algorithm.

MATCH (start:Monument{name:’Cathedral of La Laguna’})
CALL gds.alpha.spanningTree.minimum.write(‘monuments’,{
startNodeId:id(start),
relationshipWeightProperty:’distance’,
weightWriteProperty:’cost’})
YIELD effectiveNodeCount
RETURN effectiveNodeCount

Results

Minimum Weight Spanning Tree algorithm results visualized in Neo4j Browser

Currently, only the write mode of the algorithm is available. We can visualize our potential cable route with Neomap.

Random walk algorithm

We can imagine the Random Walk algorithm trying to mimic a drunk crowd traversing the network. They might go left, or right, take two steps forward, one step back, we never really know. It depends on how drunk the crowd is. We can use this algorithm to provide random trip recommendations. Imagine we have just visited the University of Barcelona historical building and are not sure which monuments we should take a look at next.

MATCH (m:Monument{name:"University of Barcelona historical building"})
CALL gds.alpha.randomWalk.stream('monuments',
{start:id(m), walks:3, steps:5})
YIELD nodeIds
RETURN [nodeId in nodeIds | gds.util.asNode(nodeId).name] as result

Results

Random walk algorithm results

Remember, we mentioned that the Random Walk algorithm tries to mimic a drunk person traversing the network. Well, an intoxicated person might visit the same monument twice and not care. For example, in the first and third suggestions, a single monument shows up twice. Luckily, we have some options to influence how the algorithm should traverse the network in the node2vec mode with the following two parameters:

return: This parameter controls the likelihood of immediately revisiting a node in a walk. Setting it to a high value (> max(inOut, 1)) ensures that we are less likely to sample an already visited node in the following two steps.

inOut: This parameter allows the search to differentiate between “inward” and “outward” nodes. If inOut > 1, the random walk is biased towards nodes close to node t. In contrast, if inOut < 1, the walk is more inclined to visit nodes that are further away from the node t.

The definition of the two parameters is summarized from the original Node2vec paper.

We want to recommend monuments close to our starting point, so we choose the inOut parameter to be greater than 1. And we definitely would like to avoid revisiting an already visited node during the walk, so we choose the return parameter to be greater than the inOut parameter.

MATCH (m:Monument{name:"University of Barcelona historical building"})
CALL gds.alpha.randomWalk.stream('monuments',
{start:id(m), walks:3, steps:5,
mode:'node2vec', inOut:5, return:10})
YIELD nodeIds
RETURN [nodeId in nodeIds | gds.util.asNode(nodeId).name] as result

Results

Random Walk algorithm in node2vec mode

Unfortunately, the return parameter ensures that we are less likely to sample an already visited node in the following two steps. This means that we can’t be sure that duplicates won’t show up later during our walk. In our example, Casa Batlló appears twice in the first suggestion. We can circumnavigate this problem by creating longer walk suggestions and filtering out duplicates before showing the results to the user. In the following query, we calculate nine steps long walks, filter out duplicates, and return only the first five results.

MATCH (m:Monument{name:"University of Barcelona historical building"})
CALL gds.alpha.randomWalk.stream('monuments',
{start:id(m), walks:3, steps:9,
mode:'node2vec', inOut:5, return:10})
YIELD nodeIds
RETURN apoc.coll.toSet([nodeId in nodeIds | gds.util.asNode(nodeId).name])[..5] as result

Results

Random Walk algorithm results with removed duplicates

This way, we make sure the results never contain duplicates. Now we can visualize the results with our trip recommendation application.

Traveling salesman problem

To top it off, we will solve the Santa Claus variation of the traveling salesman problem. As mentioned, the only difference is that we omit the requirement to end up in the same location as we started. I found the inspiration for this problem in the Gaming the Christmas Market post written by David Barton. I give all the credits to David Barton for conjuring up the solution. My contribution is to update it to work with Neo4j 4.0 and the GDS library.

Say we want to find the optimal route between this monuments:

:param selection => ["Castell de Santa Pau","Castell de Sant Jaume","Castell de Vilaüt","Castell de Sarraí","Castell de Solius","Portal d'Albanyà","Castell de Sant Gregori","Casa Frigola"]

We split the solution into two steps. First, we calculate the shortest path between all pairs of selected monuments with the gds.alpha.shortestPath algorithm and store the results as the SHORTEST_ROUTE_TO relationship between the given pair of nodes. We save the total cost and all the intermediate nodes along the shortest path as the properties of the SHORTEST_ROUTE_TO relationship.

WITH $selection as selection
MATCH (c:Monument)
WHERE c.name in selection
WITH collect(c) as monuments
UNWIND monuments as c1
WITH c1,
[c in monuments where c.name > c1.name | c] as c2s,
monuments
UNWIND c2s as c2
CALL gds.alpha.shortestPath.stream('monuments',{startNode:c1,endNode:c2,relationshipWeightProperty:'distance'})
YIELD nodeId, cost
WITH c1,
c2,
max(cost) as totalCost,
collect(nodeId) as shortestHopNodeIds
MERGE (c1) -[r:SHORTEST_ROUTE_TO]- (c2)
SET r.cost = totalCost,
r.shortestHopNodeIds = shortestHopNodeIds

After completing the first step, we have created a complete graph of SHORTEST_ROUTE_TO relationships between the selected monuments.

Traveling salesman problem step 1

In the second step, we will use the apoc.path.expandConfig procedure. It enables us to perform variable-length path traversals with fine-grained control over the traversals. Check out the documentation for more details.

We allow the procedure to traverse only SHORTEST_ROUTE_TO relationships with the relationshipFilter parameter and visit only the selected monuments with the whitelistNodes parameter. We ensure that all selected nodes must be visited exactly once by defining the number of hops or levels traversed (minLevel and maxLevel) and with the uniqueness parameter. I know it is a lot to comprehend, and if you need some help, I would suggest asking questions on the Neo4j community site. We then select the path with the minimum sum of relationship weights as the solution. Because we calculate all the possible routes between the chosen monuments, this is a brute-force solution of the traveling salesman problem.

WITH $selection as selection
MATCH (c:Monument)
WHERE c.name in selection
WITH collect(c) as monuments
UNWIND monuments as c1
WITH c1,
[c in monuments where c.name > c1.name | c] as c2s,
monuments,
(size(monuments) - 1) as level
UNWIND c2s as c2
CALL apoc.path.expandConfig(c1, {
relationshipFilter: 'SHORTEST_ROUTE_TO',
minLevel: level,
maxLevel: level,
whitelistNodes: monuments,
terminatorNodes: [c2],
uniqueness: 'NODE_PATH'})
YIELD path
WITH path,
reduce(cost = 0, x in relationships(path) | cost + x.cost) as totalCost
ORDER BY totalCost LIMIT 1
WITH path,
totalCost,
apoc.coll.flatten([r in relationships(path) | r.shortestHopNodeIds]) as intermediate_stops,
[n in nodes(path) | id(n)] as node_ids
RETURN [n in nodes(path) | n.name] as path,
round(totalCost) as total_distance,
[optional in intermediate_stops where not optional in node_ids | gds.util.asNode(optional).name] as optional_stops

Results

Santa Claus solution

In the path column of the results, we have an ordered array of selected monuments to visit. Our travel would start with Castell de Sant Jaume and continue to Castell de Vilaüt and so on. We could dub this the Spanish castle-visiting trip as we selected six castles, and we have an option to see four more along the way. The total air distance of the path is 126 kilometers. Let’s visualize the results with our trip recommendation application.

Red markers are the selected monuments, and the blue markers are the optional stops along the way.

Conclusion

We have demonstrated most of the pathfinding algorithm available in the Neo4j Graph Data Science library with some real world use cases. The only puzzle left in this series is to finish the trip recommendation application. I have a plan to show off the application in the part three of the series. Till then, I encourage you to play around with various GDS library algorithm or try to recreate this series on a Neo4j sandbox instance. If you have any further questions, there are a bunch of Neo4j experts ready to help you on Neo4j community site.

As always, the code is available on GitHub.

--

--

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN