Hands-on Tutorials

How to Create Representations of Entities in a Knowledge Graph using pyRDF2Vec

A tutorial on how to tackle downstream ML tasks on data represented in a Knowledge Graph.

Gilles Vandewiele
Towards Data Science
11 min readNov 2, 2020

--

Table of contents

Representing Data With Knowledge Graphs

Graphs are data structures that are useful to represent ubiquitous phenomena, such as social networks, chemical molecules and recommendation systems. One of their strengths lies in the fact that they explicitly model relations (i.e. edges) between individual units (i.e. nodes), which adds an extra dimension to the data.

We can illustrate the added value of this data enrichment using the Cora citation network. This dataset contains a bag-of-words representation for a few hundred papers and the citation relations between each of these papers. If we apply dimensionality reduction (t-SNE) to create a 2D plot of the bag-of-words representations (Figure 1, left), we can see clusters (they are colored according to their research topic) arise but they overlap. If we produce an embedding with a graph network (Figure 1, right), that takes into account the citation information, we can see the clusters being better separated.

Figure 1: left: A t-SNE embedding of the bag-of-words representations of each paper. right: An embedding produced by a graph network that takes into account the citations between papers. source: “Deep Graph Infomax” by Velickovic et al.

Knowledge Graphs (KG) are a specific type of graph. They are multi-relational (i.e. there are different edges for different types of relations) and directed (i.e. the relations have a subject and object). These properties allow to represent information from heterogeneous sources in a uniform format. We can convert the Knowledge Graphs to regular directed graphs, which facilitates further analysis, as shown in Figure 2.

Figure 2: Converting a multi-relational directed KG in a regular directed graph. Image by author.

Running Example: Countries in DBpedia

We will use a running example throughout this post. Let’s start by focusing on creating representations for a few randomly chosen countries from all over the world. We will extract information for each of the countries from DBpedia, a large general-purpose KG that is created from Wikipedia.

Let’s take a look at how the KG looks in the neighbourhood of a specific country: 🇧🇪 Belgium 🇧🇪. This process is analogous to going to its corresponding DBpedia page and then recursively clicking on all the links on that page. We depict this below in Figure 3. We notice that expanding this neighbourhood iteratively makes things complex quickly, even though we introduced some simplifications by removing some of the parts. Nevertheless, we see that DBpedia contains some useful information about Belgium (e.g., its national anthem, largest city, currency, …).

We created a custom dataset with country information by using the DBpedia SPARQL endpoint. We retrieved a list of countries from the University of Mannheim “Semantic Web for Machine Learning” repository. Each of the countries contains information regarding their inflation and academic output. This information is binarized into “high” and “low” (so two binary classification tasks). Moreover, for each of the countries we retrieved their continent (Europe, Asia, Americas, Africa or Oceania), which gives us a 5-class classification task. The KG with the information about these countries is a subset from DBpedia: for each country, we retrieved all information by expanding the KG three times. This process corresponds exactly to what is depicted in Figure 3. Due to a rate limitation placed on the SPARQL endpoint, only a maximum of 10000 nodes at depth 3 and their parents are included. The KG (in Turtle syntax) can be downloaded here, the CSV files with the list of countries and their labels can be downloaded here.

Figure 3: Recursively expanding the knowledge graph makes things complex quickly. Image by author.

Creating Entity Embeddings With RDF2Vec

RDF2vec stands for Resource Description Framework To Vector. It is an unsupervised, task-agnostic algorithm to numerically represent nodes in a KG, allowing them to be used for further (downstream) machine learning tasks. RDF2Vec builds on top of existing natural language processing techniques: it combines insights from DeepWalk and Word2Vec. Word2Vec is able to generate embeddings for each word in a provided collection of sentences (often called a corpus). To generate a corpus for a KG, we extract walks. Extracting walks is similar to visiting a DBpedia page of an entity and clicking on links. The number of clicks you make is equivalent to the number of hops in a walk. An example of such a walk, again for Belgium, would be: Belgium -> dbo:capital -> City of Brussels -> dbo:mayor ->Yvan Mayeur. Note that we make no distinction between predicates/properties (e.g., dbo:capital and dbo:mayor) and entities (e.g., Belgium, Brussels, Yvan Mayeur, …) in our walks, as explained in Figure 2. Each walk can now be seen as a sentence, and the hops in that walk correspond to the tokens (words) of a sentence. Once we extracted a large number of walks rooted at the entities we want to create embeddings for, we can provide that as a corpus to Word2Vec. Word2Vec will then learn embeddings for each unique hop which can then be used for ML tasks.

Introducing pyRDF2Vec

pyRDF2Vec is a repository that contains a Python implementation of the RDF2Vec algorithm. On top of the original algorithm, different extensions are implemented as well.

In this tutorial, I will explain how we can generate embeddings of entities in KG using pyRDF2Vec. Moreover, I will briefly explain some of the extensions to RDF2Vec and how those can be used within pyRDF2Vec.

Image taken from our repository.

Biased Walking or Sampling Strategies

Now, as can be noticed from Figure 3, the number of possible that walks that we can extract grows exponentially in function of the depth. This becomes problematic when we are using large KGs, such as DBpedia. In the original RDF2Vec paper, walks were just randomly sampled from the graph but Cochez et al. proposed several metrics to bias the walks. These walks were then called biased walks, but we will refer to them as sampling strategies in line with the pyRDF2Vec terminology. An example of a possible sampling strategy is shown in Figure 4.

Figure 4: One possible way to sample the next hop in a walk, would be to scale the weights in function of the number of outgoing edges. This is only one example, there are many different metrics (based on frequency, PageRank, degrees, …). Image by author.

Walk Modifications and Transformations

Up until now, we explained the walking algorithm as continuously sampling from the neighbours of a node until a certain depth is reached. However, we can make modifications to this algorithm (extraction algorithms) or we can apply post-processing on the walks to incorporate extra information (transformation algorithms). This is what we have been researching at IDLab. These strategies, applied on a simple example, can be seen in Figure 5.

Figure 5: Different strategies to extract walks (depth 2) from a graph. In this figure, A and F are the roots from which we want to extract walks. C and H belong to the same community (community detection). source: “Walk Extraction Strategies for Node Embeddings with RDF2Vec in Knowledge Graphs” by Vandewiele et al. (blog author).

Now that we have introduced these modified walking strategies, we now discussed each of the three main building blocks of the RDF2Vec algorithm: (i) a walking strategy, (ii) a sampling strategy, and (iii) an embedding algorithm (NLP). As we will further discuss, each of these building blocks are configurable in pyRDF2Vec.

Loading KGs With pyRDF2Vec

KGs are often represented in Resource Description Framework (RDF) format. pyRDF2Vec can easily load files in different RDF syntaxes by wrapping around rdflib. This will load the entire KG into RAM memory. However, this becomes problematic when the KG is larger than the available RAM memory. We therefore also support interaction with endpoints: the KG can be hosted on some server and our KG object will interact with that endpoint whenever necessary. This drastically reduces the required RAM memory at a cost of higher latencies.

Loading the metadata and a Knowledge Graph with pyRDF2Vec

Creating Our First Embeddings

Now that we have our KG loaded in memory, we can start creating embeddings! In order to do this, we create an RDF2VecTransformer and then call the fit() function with the freshly loaded KG and a list of entities. Once the model is fitted, we can retrieve their embeddings through the transform() function. One thing that is different from the regular scikit-learn flow (where we call fit() on the train data and predict() or transform() on the test data) is that both the train and test entities have to be provided to the fit() function (similar to how t-SNE works in scikit-learn). Since RDF2Vec works unsupervised, this does not introduce label leakage.

Creating our initial embeddings with default hyper-parameters. We get a 100-dimensional embedding for each of the provided entities.

The code snippet above will give us a list of lists. For each of the provided entities to the transform method, a 100-dimensional embedding will be returned. Now in order to inspect these embeddings with the human eye, we need to further reduce the dimensionality. One great technique to do this, is by using t-SNE. We can do this using the following code snippet:

Further reducing the dimensionality of our embeddings from 100D to 2D with t-SNE in order to visualize them.

Which gives us the result shown in Figure 6.

Figure 6: a t-SNE plot of our initial embeddings. We can start to see clusters of countries arise. Image by Author.

Now let’s take a look at how good these embeddings are in order to solve the three ML tasks we have discussed: two binary classification tasks (high/low inflation and high/low academic output) and a multi-class classification task (predict the continent). It should be noted that, since RDF2Vec is unsupervised, during the creation of these embeddings, this label information was never used! RDF2Vec is task-agnostic, the projection from our nodes to an embedding is not tailored towards a specific task, and the embeddings can be used for multiple different downstream tasks. Let’s create an utility function that takes as input the produced embeddings and then performs classification for all three tasks:

Fitting a classifier on the embeddings for three different tasks.

Now let’s call classify(walk_embeddings) to take a look at the performance of our baseline embeddings:

Research Rating
Accuracy = 0.765625
[[26 8]
[ 7 23]]

Inflation Rating
Accuracy = 0.5882352941176471
[[14 16]
[12 26]]

Continent
Accuracy = 0.6716417910447762
[[13 3 1 1 0]
[ 4 11 0 1 0]
[ 0 1 7 3 0]
[ 1 3 1 14 0]
[ 0 1 0 2 0]]

The accuracies for the research rating, inflation and continent classification are respectively equal to 76.56%, 58.82% and 67.16%. While these accuracies are far from perfect, it does show that some information for all these tasks is present within the generated embeddings. One possible reason for these lower accuracies is that only a subset of data from DBpedia was used due to rate limitations of its public API. Moreover, we only used default hyper-parameters to generate our embeddings.

Tuning the Hyper-Parameters

As mentioned before, each of the three building blocks of the RDF2Vec algorithm (walking algorithm, sampling strategy and embedding technique) are configurable. For now, let’s try to extract deeper walks and generate larger embeddings:

Setting a different walk depth and embedding size.

We specify that we want to use the random walking strategy to extract walks with a depth of 3, this corresponds to following 3 links from a certain DBpedia page, or taking 6 hops in our converted KG (see Figure 1). We also specify that we want to extract all possible walks of depth 3 exhaustively for each entity, indicated by the None argument. We do not specify any sampling strategy but do specify that we want to use the Word2Vec embedding technique to produce embeddings of size 500. This hyper-parameter configuration gives us the following accuracies:

Research Rating
Accuracy = 0.78125
[[25 9]
[ 5 25]]

Inflation Rating
Accuracy = 0.6470588235294118
[[14 16]
[ 8 30]]

Continent
Accuracy = 0.6865671641791045
[[15 0 0 3 0]
[ 6 9 0 1 0]
[ 1 2 6 2 0]
[ 0 3 0 16 0]
[ 0 0 0 3 0]]

As we can see, the accuracy has improved for two of the three tasks while remaining the same for the third task (inflation classification).

Trying Different Walking Strategies

pyRDF2Vec allows us to use different walking strategies, an overview of the different strategies is provided in Figure 5. Moreover, we can combine different strategies: pyRDF2Vec will extract walks with each strategy and concatenate the extracted walks together before providing it to the embedding technique. Let’s try combining several walking strategies:

The accuracies we now get are:

Research Rating
Accuracy = 0.71875
[[24 10]
[ 8 22]]

Inflation Rating

Accuracy = 0.6764705882352942
[[14 16]
[ 6 32]]

Continent
Accuracy = 0.7910447761194029
[[15 0 0 3 0]
[ 5 11 0 0 0]
[ 2 0 9 0 0]
[ 0 1 0 18 0]
[ 0 1 0 2 0]]

So an improvement for the Inflation Rating and Continent task but a decrease in performance for the Research Rating.

Sampling Deeper Walks

Now if we want to extract deeper walks, we quickly run into memory issues, since the number of walks grow exponentially in function of the walk depth. This is where sampling strategies come into play. As a final experiment, let’s sample 5000 walks of depth 6 using the walking strategies from the previous section:

In total, 854901 walks were extracted. This gives us the following accuracies:

Research Rating
Accuracy = 0.671875
[[30 4]
[17 13]]

Inflation Rating
Accuracy = 0.5
[[14 16]
[18 20]]

Continent
Accuracy = 0.8059701492537313
[[17 0 0 1 0]
[ 4 10 0 2 0]
[ 0 2 9 0 0]
[ 0 1 0 18 0]
[ 0 3 0 0 0]]

So only an improvement for the continent classification and a significant deterioration for the two other tasks. Of course, one could tune the many hyper-parameters of the RDF2VecTransformer (the walking & sampling strategy and their corresponding hyper-parameters) and the Random Forest (or any other classification technique) to obtain optimal accuracies, but we leave that as an exercise to the reader!

Shortcomings of RDF2Vec and Research Challenges

There are a few shortcomings in the current version of RDF2Vec that can be identified that require further research to tackle them:
The original RDF2Vec implementation does not incorporate the most recent insights from the NLP domain: since the inception of RDF2Vec, on 2017, many advancements in the domain of NLP have been made. We are currently working on implementing different NLP embedding techniques (such as BERT) into pyRDF2Vec.
– The expressiveness of embeddings produced by random walks can be limited: walks are only single chains, the information they capture is somewhat limited. The walking strategies try to alleviate this shortcoming, but further research in this direction is needed. Moreover, we could probably combine walks generated by different strategies together.
– RDF2Vec does not scale to large KGs: as the number of possible walks that can be extracted grows exponentially with depth, RDF2Vec does not scale well to KGs with a large number of nodes, especially when it contains many highly-connected nodes. The sampling strategies improve scalability, but more research can be done.
– RDF2Vec cannot deal well with numerical values in the KG: currently, all hops in the walks, which correspond to nodes from the KG, are handled as categorical data. This is sub-optimal for ordinal data (e.g., the number of inhabitants and the size of a country).
RDF2Vec cannot deal with volatile data: as mentioned, both the train and test data need to be provided to our fit() method. But what if some of our test data is not yet available? There are techniques that can help here, such as online and incremental learning.

We hope, by releasing pyRDF2Vec, to provide a toolkit that can facilitate research to tackle these challenges.

Code and Data Availability

The pyRDF2Vec repository can be found on Github. Feel free to give us a star if you like the repository, it is greatly appreciated! Moreover, we welcome all kinds of contributions.

The custom dataset used throughout this blog post can be downloaded: KG and CSV. We also provide all the code in a Google Colab notebook so you can run it yourself interactively from your browser!

--

--