
Before you start with this article, I assume you have some brief idea about knowledge Graph embeddings. If not, I suggest you check the following article, which gives you a good idea about the concept:
Now with this out of the way, in this article, I will show how you can generate Knowledge Graph embeddings, interpret them and also evaluate their performance on graph-based tasks.
Knowledge graph
Consider a knowledge graph of countries. In this graph, we have the names of countries and regions as entities. The relations are represented by two properties, namely, "neighbor" and "locatedin".
The graph is a built-in dataset of the PyKEEN Python library.
Check the following screenshot of the knowledge graph triples shown as a data frame:

The triples are in the form (croatia, neighbor, serbia)
and (denmark, locatedin, europe).
Below is a full visualization of the graph using the NetworkX Python library:

The graph is big, so the visualization is not too clear. Let’s check a subgraph of this knowledge graph.

The subgraph above clearly illustrates the triples, showing both the entities and relations.
Now we have a knowledge graph. Let’s understand how to generate the knowledge graph embeddings using the PyKEEN Python library.
Knowledge graph embeddings

The above shown is a visualization of the knowledge graph embeddings trained using TransR
algorithm. Observe how countries are bunched together. On the top of the visualization, we see Bosnia, Montenegro, and Albania close to each other. This indicates the embedding algorithm was able to club the countries with similar properties together.
The visualization is made using PCA, where the embedding size of 128 is reduced to a size of 2 dimensions. This might impact the accuracy of grouping the countries. Also, the embedding algorithm can be tuned to improve the performance.
Here is the code to generate the embeddings:
Prerequisites
pip install pykeen -q # install PyKEEN library
Read the data
from pykeen.datasets import Countries
import pandas as pd
# read data from pykeen dataset method
df = pd.DataFrame(Countries().training.triples)
df.columns = ['h', 'r', 't']
df.sample(10)
Generate the embeddings
from pykeen.triples import TriplesFactory
from pykeen.pipeline import pipeline
# Generate triples from the graph data
tf = TriplesFactory.from_labeled_triples(df.values)
# split triples into train and test
training, testing = tf.split([0.8, 0.2], random_state=42)
# generate embeddings using PyKEEN's pipeline method
result = pipeline(
training=training,
testing=testing,
model = "TransR",
model_kwargs=dict(embedding_dim=128),
training_kwargs=dict(num_epochs=200),
random_seed=42)
In the code above, the TriplesFactory
class provides a standardized representation of the triples in the knowledge graph. PyKEEN has different methods to read and manipulate triples. Also, to split the data into training, validation, and test sets. Here, I just split the data into train and test sets, as I am not performing any hyperparameter tuning.
I selected the TransR
model and set the embedding dimension to 128 and epochs to 200.
The Loss vs. Epochs plot looks as follows:

The loss decreases until 100 epochs and then remains almost constant. 100 epochs would have been fine for this training process. Oh well!
Retrieve the embeddings
The training process is done, and now, we can retrieve the embeddings from the result
object of PyKEEN’s pipeline()
function. We learn embeddings for both entities and relations. These embeddings can be visualized using any dimensionality reduction methods like PCA or t-SNE (same as the visualization shown in the section above).
Below is the code to get embeddings after the training process is done:
Entity embeddings
# get entity labels from training set
entity_labels = training.entity_labeling.all_labels()
# convert entities to ids
entity_ids = torch.as_tensor(training.entities_to_ids(entity_labels))
# retrieve the embeddings using entity ids
entity_embeddings = result.model.entity_representations[0](indices=entity_ids)
# create a dictionary of entity labels and embeddings
entity_embeddings_dict = dict(zip(entity_labels, entity_embeddings.detach().numpy()))
Relation embeddings
# get relation labels from training set
relation_labels = training.relation_labeling.all_labels()
# convert relations to ids
relation_ids = torch.as_tensor(training.relations_to_ids(relation_labels))
# retrieve the embeddings using relation ids
relation_embeddings = result.model.relation_representations[0](indices=relation_ids)
# create a dictionary of relation labels and embeddings
relation_embeddings_dict = dict(zip(relation_labels, relation_embeddings.detach().numpy()))
Evaluation of embeddings
We have seen the generation of embeddings and also how to retrieve them. Now let’s check their performance on link prediction. Link prediction is a graph-based task that involves predicting missing or future links between nodes in a graph.
Check my article that details graph-based tasks like link prediction and node classification.
PyKEEN provides an easy way to perform the evaluation of a link prediction task.
I will create a data frame for train and test sets so that it is easier to understand the evaluation process.
# create a train df
df_train = pd.DataFrame(training.triples)
df_train.columns = ['h','r','t']
# create a test df
df_test = pd.DataFrame(testing.triples)
df_test.columns = ['h','r','t']
I will use the scoring-based evaluation to perform the task of tail prediction (a variant of link prediction). The model has to predict the tail entity from the test set. I consider 3 cases to showcase the working of this evaluation process.
Case 1: Mexico

In the above image, we see that the model is trained on three triples involving Mexico
as a head entity. The test set has one instance with the relation neighbor.
Our objective is to correctly predict the tail entity of the test set, i.e., predict united_states
from the triple (mexico, neighbor, united_states).
from pykeen import predict
# tail prediction
predict.predict_target(model=result.model,
head="mexico",
relation="neighbor",
triples_factory=result.training).df.head(20)
The above code block performs the tail prediction for the "missing" triple (mexico, neighbor, ?).

In the output above, we see that the model ranks belize
, guatemala
and united_states
as the top predictions based on a scoring-based method. belize
and guatemala
were already present in the training set, and the model correctly predicted united_states
. Observe how the top-ranked predictions have a better score.
This is one of the use cases of the link prediction task.
Case 2: Senegal

Here, we predict the tail of the test triple (senegal, locatedin, ?). There is only one instance in the test set. The prediction is as follows:

Case 3: Bhutan

Here, there are two instances in the test set. Let’s see the predictions:

The model correctly scores the first instance of tail prediction (india
) but fails to score the second instance (china
) in the top ranks. It successfully predicts the tail label but with low confidence.
This is because the model failed to perfectly capture the network relationship in the embeddings. Remember that I did not perform the hyperparameter optimization when training the model, which might have hampered the performance.
Evaluation metrics
In the above section, I showed a basic evaluation process. We have rank-based evaluation methods that quantify the model performance. The rank-based methods have several metrics, and one of the important metrics is mean rank.
Mean rank is the average rank of the correct prediction across all test triples. In the above 3 cases, the mean rank can be calculated as follows:
Case 1: Rank = 4
Case 2: Rank = 2
Case 3: Rank = 2 (first instance)
Case 3: Rank = 13 (second instance)
Hence, the mean rank is (4+2+2+13)/4
= 5.25
The mean rank for the entire test set is as follows:
rank_metrics = result.metric_results.to_df()
rank_metrics[rank_metrics.Metric=='arithmetic_mean_rank']

We performed a tail prediction in the above 3 cases. For the entire test set, the mean rank for tail prediction is around 7.2, which is not too shabby.
We have trained a decent knowledge graph embedding model!
Thanks for reading, and cheers!
Want to Connect? Reach me at LinkedIn, Twitter, GitHub, or Website!