
Knowledge graphs enable us to comprehend how different points of knowledge relate, giving us an extensive understanding of a field or topic. These graphs help us to discern how individual pieces of knowledge come together to form the larger picture. Clearly, constructing and visualising knowledge graphs can be an effective approach to many fields.
In this article, we describe a process to generate new knowledge graphs by leveraging the largest publicly available graph that deals with human knowledge: Wikipedia. We will fully automate the generation process with Python, allowing us to create a scalable approach to generating knowledge graphs for any field of interest.
If you’d like to follow along, the end-to-end notebook is available here in Google Colab.
Approach
Our approach will be as follows:
- 🔌 Use the Wikipedia API to download information associated with a term
- 🔁 Iterate over many terms to build a knowledge base
- 🔝 Rank terms based on their ‘importance’
- 🌐 Visualise the knowledge graph using the networkx library
If you’d like to read along with the code, you can find it here in Google Colab.
The Wikipedia API
Wikipedia makes all of its knowledge available via an API. On top of that, there is a great Python package which makes it possible to scan the website with ease. With this package, we can scan a wikipedia page based on a search term, as shown in the example below.
import wikipedia as wp
ds = wp.page("data science")
You can read more about the package in this article.
The page object contains all the information we need to walk the graph and understand the relationship between various terms. The key properties to note with the object are:
- links: the outbound links that the page makes to other pages no Wikipedia
- content: the actual content of the page
- summary: they key content, shown at the top of the page.
An example from the Data science page is shown below.
The Wikipedia website is massive, with 7M English articles (Wikipedia, 2022), which means that scanning every single page would be costly, and would cover many irrelevant pages to the subject of interest. Therefore, we need to develop an algorithm that allows us to search only those relevant pages.
Searching Wikipedia
The search algorithm should start at the point of interest and then explore out from there, making sure to stay close to the point of interest but also making sure to capture the most important pages.
The algorithm we’ll follow is:
- Start with a list of terms that cover the area of interest. For example, for a knowledge graph for "data science" we might choose "data science", "machine learning" and "Artificial Intelligence".
- Get the Wikipedia page from the terms on the list using the Wikipedia API.
- Find all outbound links on the page, and calculate a weight for them. Weight can be based on how often the term appears, how close to the start of the document, or if it’s included in the summary.
- Add the new links to the list of terms.
- Find the most important term from the remaining terms and get the page for that term. We can define importance with the number of times the term has been referenced in other terms, along with the weights of those references.
- Repeat steps 3–5 until sufficient depth has been reached. For the examples that follow, this was on the order of hundreds of terms.
With this, we can begin to build up a local graph of the entire Wikipedia database which focuses around the subject we care about.
These terms can then be presented in a list of terms, ordered based on their importance. An example of this for "Data Science" is seen below.

A list is helpful to work through, but we’ve got a lot more data here we could utilise, so let’s explore network plots.
Creating Network Plots
With a network defined, we can begin to visualise it. Given the graphical nature of the data, it’s best viewed as a graph. For this, we can use the handy package networkx. Networkx is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks (Networkx, 2022).
Networkx builds on top of basic graph theory to construct graphs. An example plotting script is shown below.
import networkx as nx
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
nx.draw(G)
To plot the networks, we’ll have to use some more complex functions than what is shown in this example. In particular, we will use weighted nodes and weighted edges, based on the importance of individual terms and their connections, respectively.
Plots for "data science", "physics" and "biology" are shown below.



Looking at the field of Biology, we see an interesting graph. I’m no biologist, but this seems pretty accurate! Some points of interest:
- The term animal is closely placed next to biology and has a similar importance, which makes sense given biology is the study of living organisms.
- On the left side we see a cluster of cell related biology: amoeba, cell wall, meiosis, and bacteria. The networkx algorithm is grouping various related terms together due to their strong links. In this way, the knowledge graph can be suggestive of terms to study together.
- Given the strong link between biology and its environment, we see the field of geology showing up through terms such as earth and stratigraphy.
- As one might expect from recent events, we see that issues around climate, climate change and time are elevated as important topics. This might not be the case if this was a knowledge graph from 20 years ago.
- We have the seemingly out-of-scope terms of 1, number and isbn showing up. This is likely due to some weird references within Wikipedia, which should be removed.
Making Education Easier
Here we’ve presented a methodology for going from an area of interest to a full-blown knowledge graph. It allows us to get a list of terms ranked by importance, as well as a visualisation of how these all fit together.
In that way, these graphs can be useful for Education and learning a new area. This is true for personal study, but can also be true for tuition and broader education where curricula often leave gaps in knowledge.
Reminder: the end-to-end notebook is available here in Google Colab so please feel free to try this out for yourself and let me know what you find!