Analyzing the Evolution of Life on Earth with Neo4j

Explore the NCBI taxonomy of organisms in a graph database

Published in

Towards Data Science

8 min readJun 23, 2022

The evolution of life is a beautiful and insightful field of study that traces our origins back to the beginning of life. It helps us understand where we came from and where we are potentially going. The relationships between species are often depicted in the tree of life, which is a model used to describe relationships between various species. Since a tree structure is a form of a graph, it makes sense to store those relationships in a graph database to be analyzed and visualized.

In this blog post, I have decided to import the NCBI taxonomy of organisms into Neo4j, a graph database, where we can easily traverse and analyze relationships between various species.

Environment and dataset setup

To follow the code examples in this post, you will need to download Neo4j Desktop application. I have prepared a database dump that you can use to easily get the Neo4j database up and running without having to import the dataset yourself. Take a look at my previous blog post if you need some help with restoring the database dump.

ncbi-taxonomy-neo4j.dump

Edit description

drive.google.com

The original dataset is available on the NCBI website.

Index of /pub/taxonomy

Edit description

ftp.ncbi.nlm.nih.gov

I have used the new taxonomy dump folder downloaded on 13th June 2022 to create the above database dump. While no explicit license is specified for the dataset, the NCBI website states that all information is available within the public domain.

I have made available the code used to import the taxonomy into Neo4j on my GitHub if you want to evaluate the process or make any changes.

Graph schema

I have imported the following files into Neo4j:

nodes.dmp
names.dmp
host.dmp
citations.dmp

Some other files have redundant information that is already present in the nodes.dmp file that contains the taxonomy of organisms. I have looked a bit at genetic code files, but since I have no idea what to do with genetic code name and their translations, I have skipped them during import.

Using the above four files, I have constructed the following graph schema.

Taxonomy graph schema. Image by the author.

I have added a generic label Node to all nodes present in the nodes.dmp file. The nodes with the generic label node contain multiple properties that can be used to import other files and help experts better analyze the dataset. For us, only the name property will be relevant. The taxonomy hierarchy is represented with the PARENT relationship between nodes. The dataset also contains a file that describes potential hosts of various species. Lastly, some of the nodes are mentioned in various medical sources, which are represented as the Citation nodes.

All the nodes with the generic label Node have a secondary label that describes their rank. Some examples of ranks are Species, Family, and Genus. There are too many of them to list them all, so I have prepared a screenshot with all available node labels.

All node labels available in the database. Image by the author.

Exploratory analysis

All the code in this analysis is available on GitHub in the form of a Jupyter notebook, although the queries have been modified to work with Pandas Dataframe instead of visualization tools.

I looked for Homo Sapiens species in the dataset but couldn’t find it. Interestingly, the folks at NCBI decided to name our species simply Human. We can examine the taxonomy neighborhood up to four hops with the following Cypher statement:

MATCH p=(n:Node {name:"human"})-[:PARENT*..4]-()
RETURN p

Results

Taxonomy neighborhood of human species. Image by the author.

I am making the visualizations in Neo4j Bloom as it offers a hierarchical layout, which is perfect for visualizing taxonomies. One of the advantages of using Neo4j Bloom is that it allows users who are not experienced with Neo4j or Cypher to inspect and analyze graphs. Follow this link if you want to learn more about Neo4j Bloom.

So, human node is a species that belongs to a humans genus, which is a part of the Pongidae family. After a quick Google search it seems that Pongidae taxon is obsolete, and Hominidae should be used, which is represented in the NCBI taxonomy as a super family. Interestingly, the human species has two subspecies, namely neanderthals and denisovans, which are represented under the homo sp altai node. I just learned something new about our history.

The NCBI taxonomy dataset contains only 10% of the described species of life on the planet, so don’t be surprised if there are missing species from the dataset.

Let’s examine how many species are there in the dataset with the following Cypher statement:

MATCH (s:Species)
RETURN count(s) AS speciesCount

There are almost two million species described in the dataset, which means there is plenty of room to explore.

Next, we can examine the taxonomy hierarchy for human species all the way to the root of the tree using a simple query:

MATCH (:Node {name:'human'})-[:PARENT*0..]->(parent)
RETURN parent.name AS lineage, labels(parent)[1] AS rank

Result

Taxonomy hierarchy for human species. Image by the author.

It seems that there are 31 traversals needed to get from the human node to the root node. For some reason, the root node has a self-loop (relationship with itself), and that’s why it shows twice in the results. In addition, a clade, a group of organisms that have evolved from a common ancestor, shows up multiple times in the hierarchy. It looks like the NCBI taxonomy is richer than what you would find with a quick Google search.

Graph databases like Neo4j are also great at finding shortest paths between nodes in the graph. Now, we can answer a critical question of how close are apples to oranges in the taxonomy.

MATCH (h:Node {name:'Valencia orange'}), (g:Node {name:'sweet banana'})
MATCH p=shortestPath( (h)-[:PARENT*]-(g))
RETURN p

Results

Shortest path between banana and orange. Image by the author.

It seems that the closest common ancestor between sweet banana and valencia orange is Mesangiospermae clade. Mesangiospermae is a clade of flowering plants.

Another use-case for traversing relationships could be finding all the species in the same family as a particular species. Here, we will visualize all the genus in the same family as the sweet banana.

MATCH (:Node {name:'sweet banana'})-[:PARENT*0..]->(f:Family)
MATCH p=(f)<-[:PARENT*]-(s:Genus)
RETURN p

Results

Genus present in the same family as sweet banana. Image by the author.

Sweet banana belongs to the Musa genus and Musaceae family. Interestingly, there is a Musella genus, which sounds like a small Musa. In fact, after googling the Musella genus, it looks like only a single species is present in the Musella genus. The species is commonly referred to as the Chinese dwarf banana.

Inference with Neo4j

In the last example, we will look at how to develop inference queries in Neo4j. Inference means we create new relationships based on a set of rules between nodes and either store them in the database or use them at query-time only. Here, I will show you an example of inference queries using new relationships only at query-time when analyzing potential hosts.

First, we will evaluate which organism have described potential parasites in the dataset.

MATCH (n:Node)
RETURN n.name AS organism,
       labels(n)[1] AS rank,
       size((n)<-[:POTENTIAL_HOST]-()) AS potentialParasites
ORDER BY potentialParasites DESC
LIMIT 5

Results

It seems that humans are the most described and only species with potential parasites. I would venture a guess that most if not all of the potential parasites for humans are also potential parasites for vertebrates since the counts are so close.

We can check how many potential hosts organisms have with the following Cypher statement.

MATCH (n:Node)
WHERE EXISTS { (n)-[:POTENTIAL_HOST]->()}
WITH size((n)-[:POTENTIAL_HOST]->()) AS ph
RETURN ph, count(*) AS count
ORDER BY ph

Results

18359 organisms have only one known host, while 163434 have two known hosts. Therefore, my hypothesis that most parasites that attack humans also potentially attack all vertebrates is valid.

Here is where the inference queries comes into play. We know that vertebrates is a higher level taxon in the taxonomy of organisms. Therefore, we can traverse from vertebrates to the species level to examine which species could be potentially used as hosts.

We will use the example of Monkeypox virus as it is relevant in this time. First, we can evaluate its potential hosts.

MATCH (n: Node {name:"Monkeypox virus"})-[:POTENTIAL_HOST]->(host)
RETURN host.name AS host

Results

Notice that both human and vertebrates are described as potential hosts of Monkeypox virus. However, let’s say we want to examine all the species that are potentially endangered by the virus.

MATCH (n: Node {name:"Monkeypox virus"})-[:POTENTIAL_HOST]->()<-[:PARENT*0..]-(host:Species)
RETURN host.name AS host
LIMIT 10

Results

We have used a limit as there are a lot of vertebrates. Unfortunately, we don’t know which of them are extinct as that would help us filter them out and identify only potential victims of the Monkeypox virus that are still alive. However, it is still an excellent example of inference in Neo4j, where we create or infer a new relationship based on the predefined set of rules at query time.

Conclusion

I really enjoyed writing this article as it gave me an opportunity to explore the taxonomy of bananas and oranges. You can use this dataset as a hobbyist to explore your favourite species or even in a more professional environment. Simply download the database dump, load it into Neo4j, and get started.

The code is available on GitHub.

Analyzing the Evolution of Life on Earth with Neo4j

Explore the NCBI taxonomy of organisms in a graph database

Environment and dataset setup

ncbi-taxonomy-neo4j.dump

Edit description

Index of /pub/taxonomy

Edit description

Graph schema

Exploratory analysis

Inference with Neo4j

Conclusion

Written by Tomaz Bratanic