Beating Wikirace by Finding the Optimal Path with Neo4j and Dijkstra’s algorithm

Exploring Wikipedia dataset and applying graph data science algorithms

Peder Ward
Towards Data Science

--

Illustration of a Wikirace. Edited picture from Wikimedia Commons. CC BY-SA 3.0

Introduction

A few weeks ago, I came across an interesting game called Wikirace by watching some streamers play this game. A Wikirace is a race between any number of participants, using links to travel from one Wikipedia page to another.

Normally the participants start at a random Wikipedia page and choose a destination page. For example, they could start at the Wikipedia page of “Sweden” and the destination page is “Dracula”. The goal is to reach the destination page, “Dracula”, from the source page, “Sweden”, only by using the hyperlinks on the Wikipedia sites. The first person to reach the destination page, or the person that reaches the destination using the fewest links, wins the race.

Although the game itself is interesting and fun, it was the optimal solution that caught my eye. What is the optimal path from a source page to a target page by only using hyperlinks in Wikipedia? After analyzing the problem, I figured a graph database with shortest path algorithms would be the best solution to find the optimal paths. In this project I will use the Neo4j graph database. I will start this article with a quick introduction in graph databases.

Illustration of a graph database by Martin Grandjean. CC BY-SA 4.0

Graph Database

A graph database (GDB) is a database which uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. Using a GDB I can structure the relations between all Wikipedia pages. In a GDB, a relation between two nodes could be represented like this:

(n:SourceNode)-[RELATION]->(m:TargetNode)

For example, on the Wikipedia page of Sweden, shown below, the relation to other hyperlinks can be structured as:

(n:Sweden)-[HyperLink]->(m:Nordic country) 
(n:Sweden)-[HyperLink]->(m:Finland)
Snipped text from Wikipedia page of Sweden.

This relation is shown in a GDB as nodes and edges:

Nodes and edges making relations. Image by author

Dataset

Before I can try to solve this problem, I need all the relations in Wikipedia. I will reduce the data by adding a constraint to only use English Wikipedia pages. Stanford University has a dataset called “English Wikipedia hyperlink network” by Jure Leskovec. This dataset describes the relations between all English Wikipedia pages from 2013. The folder includes two files. One csv. file with all nodes ID and their names. The other file is a txt. file that describes the relation between the different Wikipedia pages. Below is a sample from the relation.txt file:

1 2435 35
2 354 87
3 44 91120
4 492 5
5 …

Each line describes one relation. The first number is the ID for that exact relation. The second number is the Source page, and the third number is the Target page. For example, if you are located on Wikipedia page 2435 there is a hyperlink located on that page referring to Wikipedia page 35. I assume the numbers are used to reduce the .txt file, instead of using the full name. The folder also includes a file with all the numbers and their appropriate Wikipedia page name. For example, the node with ID number 35 is the Wikipedia page of “Sweden”.

To import the files into Neo4j Graph DataBase I want to use the Neo4j Admin import function. It is recommended to use this import function when the dataset is ‘’huge’’, and in this case, the dataset includes over 4 million nodes (Wikipedia pages) and 101 million edges (relations/hyperlinks).

Table from Stanford University by Jure Leskovec.

To import the files to the Neo4j database I first converted the file from .txt to .csv in Python. My final plan is to use the relationships to find the shortest path from one page to another using graph algorithms. To find the shortest path, the algorithms need to know the distance between the nodes. In this case, all nodes have the same distance to their neighbor. To include the distance in the dataset I added a new column to the .csv file with the distance equal to one. I then cleaned the dataset and added the appropriate header to make use of the Neo4j Admin import function. The two images below show a sample from the relations and nodes .csv files.

Sample data from the nodes and relations .csv files.

To upload the data to Neo4j GDB I used the Neo4j Admin import function with the following command:

bin/neo4j-admin import --database=neo4j --nodes=import/nodes.csv --relationships=import/relations.csv --multiline-fields=true

Make sure the database is empty before using this function.

After the import is done, I can visualize the relationships with the Neo4j Browser in Neo4j Desktop. First, I want to know how many Wikipedia pages have a hyperlink on the Wikipedia page of “Sweden”.

match(n:WikiPage{name:'Sweden'})-[r:RELATION]->(m) return count(m)count(m) : 840

This result shows that 840 Wikipedia pages have a hyperlink on the Wikipedia page of “Sweden”. I visualized 20 of these with this query.

match(n:WikiPage{name:'Sweden'})-[r:RELATION]->(m) return n,m limit 21
Some Wikipedia pages that have a relation with Wikipedia page of “Sweden”.

The result shows that many of the first relations with “Sweden” are also related to each other. For example, it’s possible to go from the Wikipedia page “Sweden” to “Head of state” directly with distance=1 or by “Russia” or “Cuba” with distance = 2. This is something we can confirm from the original Wikipedia page of “Sweden”:

Copied text from Wikipedia page of “Sweden”.

In this case the Shortest Path algorithm would choose the path “Sweden” -> “Head of state”, since this path has the shortest distance.

Finding the shortest path

I can now finally use graph algorithms to find the shortest path between two nodes in this graph. First, I need to install the Graph Data Science Library from the Plugins in the Neo4j Desktop. To use the algorithms included in the GDS Library it is required that the graph is projected. Projecting a graph is done by the query below. More details about projecting a graph can be found here:

CALL gds.graph.project('myGraph','WikiPage','RELATION',{relationshipProperties:'distance'})

To find the shortest path in the graph I choose to use Dijkstra’s algorithm.

Dijkstra’s algorithm is an algorithm for finding the shortest paths between nodes in a graph, which may represent, for example, road networks. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later. [1]

I will not provide a detailed explanation of how Dijstra’s algorithm works, but the animation below shows the main algorithm. A good explanation could be found here.

Animation of Dijkstra’s algorithm by Ibmua. Released in public domain.

It’s now possible to run Dijkstra’s algorithm with the query below. To test the algorithm, I want to find the shortest path from “Sweden” to “Dracula”.

MATCH (source:WikiPage {name: "Sweden"}), (target:WikiPage {name: "Dracula"})CALL gds.shortestPath.dijkstra.stream('myGraph', { sourceNode: source, targetNode: target, relationshipWeightProperty: 'distance'})YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, pathRETURN
index, gds.util.asNode(sourceNode).name AS sourceNodeName, gds.util.asNode(targetNode).name AS targetNodeName, totalCost, [nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames, costs, nodes(path) as path
ORDER BY index--------------------------------------------------------------------Result:index: 0
sourceNodeName: “Sweden”
targetNodeName: “Dracula”
totalCost: 3.0
nodeNames: ["Sweden", "Industrial Revolution", "Lord Byron", "Dracula"]
costs: [0.0, 1.0, 2.0, 3.0]

The result shows that a minimum of three links is needed to get from “Sweden” to “Dracula”. The path can be visualized in a graph shown in the image below:

Shortest path from the Wikipedia page “Sweden” to “Dracula”.

By using Wikipedia.com I can confirm that this path exists, and I trust that the implemented Dijkstra’s algorithm has found the shortest path.

At thewikigame.com different people are competing in finding a target page from a source page. Every 120 seconds a new target and source Wikipedia page is announced. Whoever is able to most quickly get from the source page to the target page wins the race. I will not challenge these players with my code, as that would be considered cheating. I want to use the results from played games to compare how close the contestants are to the optimal solutions found with Dijkstra’s algorithm. To compare my algorithm against the best wikiracers, I wrote a short Python code that takes the source and target page from the previous run and uses the Dijkstra algorithm. The results are shown in the image below.

DataFrame showing the quickest path from thewikigame.com compared to the shortest path found by Dijkstra’s algorithm.

As the table shows, the DijkstrasPath is always shorter or equal to the fastestPath, which is the fastest path found by a person on thewikigame.com. It is however impressive that some people have actually found a path equal in length to DijkstrasPath. Remember, there may be many different shortest paths in this problem, which results in multiple solutions. The players on thewikigame.com normally use 30–90 seconds to find the fastest path, while Dijkstra’s algorithm uses 1–2 seconds. If we look at the path from “Rush Limbaugh’’ to “The Animals”, the fastest path on Wikirace is length six and Dijkstra’s has found a shorter path with length four, as the image below represents.

Comparing path found by Dijkstra’s algorithm and the contest winner in a Wikirace.

This concludes that it is possible to use graph algorithms, for example, Dijkstra’s algorithm, to find the shortest paths between two Wikipedia pages, and possible win a Wikirace. One problem with the dataset used in this project is that the dataset is from 2013. The outdated dataset may cause problems, since some paths from 2013 do not exist today. Certain new paths have also been implemented since 2013.

Other Graph algorithm in Neo4j GDS

While the graph is projected, I want to do some more analysis on the Wikipedia dataset. For example: which Wikipedia pages are most referred to by other Wikipedia pages? I’m not a Wikirace expert, but I would guess that Wikipedia sites with many relations would be a great place to start a race.

Neo4j has algorithms called Centrality algorithms.

Centrality algorithms are used to determine the importance of distinct nodes in a network. [2]

One specific Centrality algorithm, the Page Rank, could help me find the most important nodes, or Wikipedia sites.

The PageRank algorithm measures the importance of each node within the graph, based on the number of incoming relationships and the importance of the corresponding source nodes. [3]

I can run the PageRank algorithm with this query:

CALL gds.pageRank.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, scoreORDER BY score DESC, name ASC

The results are interesting. The picture below shows that “United States” is the most important node in the dataset. The other important nodes are also countries. From this analysis, I would recommend trying to get to a Wikipedia page by choosing a major country at the start of a Wikirace. Some more investigation shows that “United States” has 938 hyperlinks on the Wikipedia page, and 431705 other Wikipedia pages have hyperlinks to the Wikipedia page of “United States”.

Most important nodes in the Wikipedia graph DB found by PageRank algorithm

I could also explore the less important nodes by sorting the results in ascending order. The result shows many nodes with equally low scores. By looking at the nodes in the graph, I can see they only have one relation. Many of the nodes or Wikipedia pages with this low score are updated today and have more relations than the 2013 dataset.

Less important nodes in the Wikipedia graph DB found by PageRank algorithm

Further analysis

Some further analysis on random Wikipedia pages can be visualized in the bar plot below. The plot shows the difference in hyperlinks referring to the random page, and hyperlinks referring to other Wikipedia pages from the random page.

Comparing hyperlink references to and out from random Wikipedia pages.

For example, the Wikipedia page of “Elon Musk” has approximately 10² hyperlinks referring to other Wikipedia pages (out) and has nearly equal the amount of hyperlinks referring to the “Elon Musk” Wikipedia page (in). At first glance, it looks strange that “MacGyver” has more references in Wikipedia than “Elon Musk”. But as mentioned previously, the dataset is from 2013. If I investigate this further in Google Trends the result shows that the sum of Google search hits is larger for “MacGyver” than “Elon Musk”. But as the graph shows, the trend for “Elon Musk” is increasing after 2010.

Google Trend on Elon Musk and MacGyver in USA.

Conclusion

The goal of this project was to apply algorithms of some kind in order to find the optimal path in a Wikirace. With this article, I have shown that it is possible to find the optimal path using Neo4j GDB, Graph Data Science, and Dijkstra’s algorithm with the Stanford University Wikipedia relation dataset. I then compared a few real wikiraces with the optimal path found by Dijkstra’s algorithm. Even though Dijkstra’s algorithm almost always found the shortest path, few of the wikiraces had the same path length. With the results, I can conclude that the project has been successful, and I have found the optimal solution. As previously mentioned, the dataset being from 2013, this project can only guarantee finding the optimal path with Wikipedia pages from 2013.

Finally, I explored some other GDS algorithms and did some further analysis of the results. This project was made only for educational purposes, and was not intended to be used in games such as Wikirace.

References

[1] — Dijkstra’s algorithm, Wikipedia
[2] — Centrality Algorithms, Neo4j
[3] — PageRank, Neo4j

LinkedIn Peder Ward

--

--