Modeling Biomedical Data for a Drug Discovery Knowledge Graph

Debrief from AstraZeneca’s Data Science and Artificial Intelligence

Published in

Towards Data Science

8 min readOct 6, 2020

Earlier this month, we were joined by Natalie Kurbatova, Associate Principle Scientist at AstraZeneca on the first series of Orbit.

Natalie works in AstraZeneca’s Data Science and Artificial Intelligence department, where she focuses on data modeling, integration of data into a knowledge graph, prediction algorithms, and the topics therein.

The Goal: Predict novel disease targets

At AstraZeneca, Natalie’s team focuses on building a Knowledge Graph to predict new disease targets (gene or protein targets), which they call a Discovery Graph. In building this, Natalie walked us through the two types of data sources to consider:

Structured data refers to the publicly available datasets in bioinformatics that have already been curated and used extensively in the industry. While biomedical structured data is machine readable, it is not often straightforward to work with. In particular, it’s difficult to integrate these datasets as they can describe similar concepts in different ways, for example: distinct IDs that do not align with each other. Some of the most used publicly available datasets include: Ensembl, Uniprot, ChEMBL, PubChem, OmniPath, Reactome, GO, CTD, HumanProteinAtlas.

Unstructured data refers to data from text. To process this we need to use NLP (Natural Language Processing) pipelines and then process their output. Here, the difficulty is that this data is often messy and noisy. For their NLP engine, Natalie’s team used the open source library SciBERT as well as AstraZeneca’s proprietary tools.

Natalie then introduced us to the schema that her team built for their Discovery Graph (slide below).

The visualised schema in TypeDB Studio — focusing on defined entities. Slide(s) reposted with permission.

Natalie’s team is mainly interested in studying compounds, diseases, and gene/proteins— which they affectionately call the Golden Triangle. The connections between these entities need to be as solid and reliable as possible, which means ingesting all possible, related, data sources into their Discovery Graph.

This Discovery Graph is growing in size daily. As of today it is already widely populated with these three entity types:

gene-protein : 19,371
disease : 11,565
compound :14,294
There are also 656,206 relations between the entity types.

How are they modelling this “Golden Triangle”?

Natalie then explained how she modelled each part of the Discovery Graph, giving examples along the way.

Modelling Genes and Proteins

First, her team looked at how to model genes and proteins. Instead of separating these into two entities, Natalie’s team decided to model them as a single entity, which they called gene-protein. This helps to reduce noise and bias.

The [gene-protein] entity visualised in TypeDB Studio. Slide(s) reposted with permission.

The associated attributes gene-name and chromosome, assign the corresponding gene name, and chromosome name of where that gene is located. The gene-id attribute is modelled as an abstract which contain the unique IDs for each gene-protein.

💡 There may have been scenarios where a parent attribute is only defined for other attributes to inherit, and under no circumstance do we expect to have any instances of this parent. To model this logic in the schema, use the abstract keyword.

The interaction relation establishes the interactions between gene-protein, where one plays the role of gene-source, and the other gene-target. This relation is particularly important in predicting novel disease targets as it connects regulatory interactions between genes and proteins.

Modelling Compounds

Natalie’s team put together — instead of separating them—small molecules and antibodies as one entity type they called compound. For this data, they used the data sources: PubChem and ChEMBL.

These databases are roughly 95% the same, but some chemicals only exist in one of the two sources. To deal with these unique chemicals, they decided to assign a chembl-id as the primary ID, if it had one, however if it didn’t have that ID, then they would use pubchem-id.

The [compound] entity visualised in TypeDB Studio. Slide(s) reposted with permission.

As we could see on Natalie’s slide, compound was modelled with a few more attributes. The attribute compound-id was used as an abstract attribute with its children preferred-compound-id, pubchem-id, chembl-id and drugbank.

To obtain the value for the preferred-compound-id attribute, they use two rules to assign either the chembl-id or pubchem-id according to the logic below:

Rules used to assign the appropriate (ID) to a particular [compound]. Slide(s) reposted with permission.

The first rule attaches the value of chembl-id — if it exists—to preferred-compound-id. The second rule first checks if compound does not have a chembl-id, and if that’s the case, then it attaches the value of the pubchem-id attribute to preferred-compound-id. This makes it easy to query for that attribute when asking specifically for compounds.

Modeling Diseases

Natalie explained that the most complex concept to model were diseases. This is because diseases have multiple ontologies, each with different IDs assigned to them. For instance, a single disease might have multiple IDs from different ontologies and disease hierarchies. In the slide below, Natalie showed us the hierarchies of three data sources: EFO, MONDO and DOID.

Ontological hierarchies of disease data sources. Slide(s) reposted with permission.

The reason for this heterogeneity is that originally, these ontologies were designed in different sub-domains: for example medical doctors use Orphanet while a medical research group might use another: EFO or MONDO. This leads to unconnected and disparate data. Natalie and her team want to be able to model a disease entity that cross references between these ontologies.

Where did the data come from? Slide(s) reposted with permission.

Because of this challenge, they chose to model two entity types: disease and ontology. They use a disease-hierarchy relation — a ternary relation — to connect the two.

The [disease] entity visualised in TypeDB Studio. Slide(s) reposted with permission.

With this disease-hierarchy relation, Natalie and her team are able to write useful queries such as:

Give me all the children of “chronic kidney disease” disease node using EFO ontology.

Example query and expected path through hierarchies. Slide(s) reposted with permission.

This query shows the power of the model — the power of using TypeDB’s schema—because even though they are just asking for a high level disease, the query will return all possible sub-types of that disease.

To model this, Natalie leverages TypeDB’s rule engine again:

Example of a transitivity rule in TypeDB. Slide(s) reposted with permission.

The rule that’s shown on the slide above, single-ontology-transitivity, creates a transitivity in the disease-hierarchy relation. The result of this rule is that all diseases that play the role of subordinate-disease will be inferred if you query for a disease that plays the role of superior-disease. Further, this means that even if you don’t specify the ontology that you are querying from, you are still returned all subordinate diseases, of that particular parent disease, across all ingested data sources.

This type of rule is especially useful when there is no corresponding reference ID in a particular ontology.

Mapping disease entity to appropriate [disease-id]. Slide(s) reposted with permission.

As Natalie showed us, when a disease-id is not present in a certain ontology, we can go up the hierarchy until we find an ID that does exist and then assign that ID. To do this, Natalie uses another transitive rule:

When an (id) doesn’t exist, this transitivity rule pulls down from the parent. Slide(s) reposted with permission.

The first rule, determine-best-mesh-id-1, assigns a best-mesh-id attribute to a disease entity if a mesh-id is already present. The second rule then states that if we don’t know the mesh-id, we want to pull down the mesh-id from a parent disease. Natalie emphasised how effective this was, and how positive the results have been in practice.

Data Integration

Once the domain has been modeled, we can start ingesting out data. To do this there are two approaches that can be taken:

Data factory: integrate data before loading into a knowledge graph
Data schema: data integration happens in the knowledge graph through flexible data loading and subsequent reasoning

Natalie’s research group uses the Data Schema approach, using the knowledge graph itself as a tool to integrate the data. This may be counter-intuitive to some developers.

If you imagine adding additional unknown—from the start that is—data sources, you have to be flexible enough to manipulate the data inside the database. This enables the replication of another’s work[paper] to validate the hypotheses.

In the case of research work, flexibility is essential, as Natalie showed us in previously conflicting mesh-ids.

The solution, as noted, is using TypeDB’s reasoning engine.

Summary

In closing, Natalie spent a few minutes summarising the benefits of this approach, in relation to the data schema and logical reasoning.

Other graph databases are flexible in data loading, but lack validation. And while flexibility is important, validation is necessary to keep the data consistent.

A few months ago, one of Natalie’s junior colleagues accidentally loaded movie data into their biomedical knowledge graph. The next morning, the rest of the team was surprised to see films and actors among their biomedical data!

With TypeDB, data is logically validated via the data schema at insert, to ensure the right data goes into the knowledge graph.

Natalie believes this database provides a nice trade-off between formal schema design, logical reasoning and prediction algorithm capabilities. In her experience, before loading data into the knowledge graph, the data needs to be modeled first. And a formal and flexible schema help to find this ideal balance.

💡 Note that all prediction algorithms depend on what you load in the database — aka: be careful what you put in the knowledge graph. The noise level should be as low as possible.

And finally, she spoke about the key choice of whether we integrate our data before or after loading the data in, which in their case (as explained above) is done after loading, once inserted into the database.

Special thank you to Natalie and her team at AstraZeneca for their enthusiasm, dedication and thorough explanation.

You can find the full presentation on Vaticle’s YouTube channel here.