Knowledge graph completion with PyKEEN and Neo4j

Integrate PyKEEN library with Neo4j for multi-class link prediction using knowledge graph embedding models

Tomaz Bratanic
Towards Data Science

--

A couple of weeks ago, I met Francois Vanderseypen, a Graph Data Science consultant. We decided to join forces and start a Graph Machine learning blog series. This blog post will present how to perform knowledge graph completion, which is simply a multi-class link prediction. Instead of just predicting a link, we are also trying to predict its type.

Knowledge graph completion example. Image by the author.

For knowledge graph completion, the underlying graph should contain multiple types of relationships. Otherwise, if you are dealing with only a single kind of relationship, you can use the standard link prediction techniques that do not consider the relationship type. The example visualization has only a single node type, but in practice, your input graph can consists of multiple node types as well.

We have to use the knowledge graph embedding models for a multi-class link prediction pipeline instead of plain node embedding models.
What’s the difference, you may ask.
While node embedding models embed only nodes, the knowledge graph embedding models embed both nodes and relationships.

Embedding nodes and relationships via knowledge graph embedding models. Image by author.

The standard syntax to describe the pattern is that the starting node is called head (h), the end or target node is referred to as tail (t), and the relationship is r.

The intuition behind the knowledge graph embedding model such as TransE is that the embedding of the head plus the relationship is close to the embedding of the tail if the relationship is present.

Image by the author.

The predictions are then quite simple. For example, if you want to predict new relationships for a specific node, you just sum the node plus the relationship embedding and evaluate if any of the nodes are near the embedding sum.

For more detailed information about knowledge graph embedding models, I suggest you check out the following lecture by Jure Leskovec.

Agenda

If you read any of my previous blog posts, you might know that I like to use Neo4j, a native graph database, to store data. You will then use the Neo4j Python driver to fetch the data and transform it into a PyKEEN graph. PyKEEN is a Python library that features knowledge graph embedding models and simplifies multi-class link prediction task executions. Lastly, you will store the predictions back to Neo4j and evaluate the results.

I have prepared a Jupyter notebook that contains all the code in this post.

Prepare the data in Neo4j Desktop

We will be using a subset of the Hetionet dataset. If you want to learn more about the dataset, check out the original paper.

To follow along with this tutorial, I recommend you download the Neo4j Desktop application.

Once you have installed the Neo4j Desktop, you can download the database dump and use it to restore a database instance.

Restore a database dump in Neo4j Desktop. Image by author.

If you need a bit more help with restoring the dump file, I’ve written a blog post about it about a year ago.

If you have successfully restored the database dump, you can open the Neo4j Browser and execute the following command.

CALL db.schema.visualization()

The procedure should visualize the following graph schema.

Graph model. Image by author.

Our subset of the Hetionet graph contains genes, compounds, and diseases. There are many relationships between them, and you would probably need to be in the biomedical domain to understand them, so I won’t go into details.

In our case, the most important relationship is the treats relationship between compounds and diseases. This blog post will use the knowledge graph embedding models to predict new treats relationships. You could think of this scenario as a drug repurposing task.

PyKEEN

PyKEEN is an incredible, simple-to-use library that can be used for knowledge graph completion tasks.
Currently, it features 35 knowledge graph embedding models and even supports out-of-the-box hyper-parameter optimizations.
I like it due to its high-level interface, making it very easy to construct a PyKEEN graph and train an embedding model.
Check out its GitHub repository for more information.

Transform a Neo4j to a PyKEEN graph

Now we will move on to the practical part of this post.

First, we will transform the Neo4j graph to the PyKEEN graph and split the train-test data. To begin, we have to define the connection to the Neo4j database.

The run_query function executes a Cypher query and returns the output in the form of a Pandas dataframe. The PyKEEN library has a from_labeled_triples that takes a list of triples as an input and constructs a graph from it.

This example has a generic Cypher query that can be used to fetch any Neo4j dataset and construct a PyKEEN from it. Notice that we use the internal Neo4j ids of nodes to build the triples data frame. For some reason, the PyKEEN library expects the triple elements to be all strings, so we simply cast the internal ids to string. Learn more about how to construct the triples and the available parameters in the official documentation.

Now that we have our PyKEEN graph, we can use the split method to perform the train-test data split.

It couldn’t get any easier than this. I must congratulate the PyKEEN authors for developing such a straightforward interface.

Train a knowledge graph embedding model

Now that we have the train-test data available, we can go ahead and train a knowledge graph embedding model. We will use the RotatE model in this example. I am not that familiar with all the variations of the embedding models, but if you want to learn more, I would suggest the lecture by Jure Leskovec I linked above.

We won’t perform any hyper-parameter optimization to keep the tutorial simple. I’ve chosen to use 20 epochs and defined the dimension size to be 512.

p.s. I’ve later learned that 20 epochs probably isn’t enough to get meaningful training on a large, complex graph; especially with such a high dimensionality.

Multi-class link prediction

The PyKEEN library supports multiple methods for multi-class link prediction.
You could find the top K predictions in the network, or you can be more specific and define a particular head node and relationship type and evaluate if there are any new connections predicted.

In this example, you will predict new treats relationships for the L-Asparagine compound. Because we used the internal node ids for mapping, we first have to retrieve the node id of L-Asparagine from Neo4j and input it into the prediction method.

Store predictions to Neo4j

For easier evaluation of the results, we will store the top five predictions back to Neo4j.

You can now open the Neo4j Browser and run the following Cypher statement to inspect the results.

MATCH p=(:Compound)-[:PREDICTED_TREATS]->(d:Disease)
RETURN p

Results

Predicted treats relationship between L-Asparagine and top five diseases. Image by the author.

As I am not a medical doctor, I can’t say if the predictions make sense or not. In the biomedical domain, link prediction is part of the scientific process of generating hypotheses and not blindly believing the results.

Explaining predictions

As far as I know, the knowledge graph embedding model is not that useful for explaining predictions. On the other hand, you could use the existing connections in the graph to present the information to a medical doctor and let him decide if the predictions make sense or not.

For example, you could investigate direct and indirect paths between L-Asparagine and colon cancer with the following Cypher query.

MATCH (c:Compound {name: "L-Asparagine"}),(d:Disease {name:"colon cancer"})
WITH c,d
MATCH p=AllShortestPaths((c)-[r:binds|regulates|interacts|upregulates|downregulates|associates*1..4]-(d))
RETURN p LIMIT 25

Results

Indirect paths between L-Asparagine and colon cancer. Image by the author

On the left side, we have the colon cancer, and on the right side there is the L-Asparagine node. In the middle of the visualization there are genes that connect the two nodes.

Out of curiosity, I’ve googled L-Asparagine in combination with colon cancer and came across this article from 2019.

While my layman’s eyes don’t really comprehend if asparagine should be increased or decreased to help with the disease, it at least looks like there seems to be a relation between the two.

Conclusion

Most of the time, you deal with graphs with multiple relationship types. Therefore, knowledge graph embedding models are handy for multi-class link prediction tasks, where you want to predict a new link and its type. For example, there is a big difference if the predicted link type is treats or causes.

The transformation from Neo4j to PyKEEN graph is generic and will work on any dataset. So I encourage you to try it out and give me some feedback on which use-cases you found interesting.

As always, the code is available on GitHub.

References

  • Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems. Curran Associates, Inc..
  • Himmelstein, Daniel Scott et al. “Systematic integration of biomedical knowledge prioritizes drugs for repurposing.” eLife vol. 6 e26726. 22 Sep. 2017, doi:10.7554/eLife.26726
  • Ali, M., Berrendorf, M., Hoyt, C., Vermue, L., Galkin, M., Sharifzadeh, S., Fischer, A., Tresp, V., & Lehmann, J. (2020). Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework. arXiv preprint arXiv:2006.13365.
  • Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, & Jian Tang. (2019). RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space.
  • Du, Feng et al. “SOX12 promotes colorectal cancer cell proliferation and metastasis by regulating asparagine synthesis.” Cell death & disease vol. 10,3 239. 11 Mar. 2019, doi:10.1038/s41419–019–1481–9

--

--

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN