The world’s leading publication for data science, AI, and ML professionals.

Language Models are Open Knowledge Graphs .. but are hard to mine!

Join me as I dive into the latest research on creating knowledge graphs using transformer based language models

Getting Started

In this article, we will explore the latest research paper on building knowledge graphs from text by leveraging transformer based language models. The paper we will look at is called "Language Models are Open Knowledge Graphs", where the authors claim that the "paper shows how to construct knowledge graphs (KGs) from pre-trained language models (e.g., BERT, GPT-2/3), without human supervision."

The last part, which claims to have removed humans from the process, got me really excited. Having said that, I have learnt to take such claims with a grain of salt. I decided to explore this method further by experimenting with various components of the pipeline proposed in the paper. The idea is to understand the impact of each step in the pipeline on a domain specific example, and gauge if there is any supervision needed by either domain experts or data scientists.

Note: I assume some familiarity with the concept of knowledge graphs and transformer based language models. Briefly, As stated here "The knowledge graph (KG) represents a collection of interlinked descriptions of entities — real-world objects and events, or abstract concepts (e.g., documents)". It allows us to model the real world into a graph, which can then be reasoned over to assert facts about the world. To learn more about transformer based models, refer to Jay Alammar’s post here and here . You can also refer to my post "Beyond Classification With Transformers and Hugging Face", where I dive into these models, especially the attention mechanisms, which is where most of the "knowledge" is stored.

Without further ado, let’s dive in! 🙂

Most of the leading companies have begun using Knowledge Graphs (KG) to organize, manage and expand knowledge about their businesses. While most businesses get quite far with populating KGs with instances from various structured data sources, they struggle with doing the same from unstructured sources like text. Extracting and building knowledge from text using Machine Learning (ML) & Natural Language Processing (NLP) based techniques has been an area of focus for many leading universities and researchers. Let’s begin by looking at the approach, generally followed by most researchers, from a bird’s-eye view below. The process flow is pretty standard across the industry and you can find many publications and talks, but I found one talk (slightly dated, but very simple and concise) here : Talk

As seen in the figure above, the process starts by parsing and enriching text with meta data like POS tags, named entities, dependency parse trees etc. The next step is to identify mentions in text (i.e. portions of the text) that can most likely qualify as entities in a knowledge graph. This is called Mention Detection. This is then passed through an entity linker, which finds the most likely candidate entities from the target KG, and then disambiguates by running logic on top (this could be ML, lookup dictionaries, string matching etc.), to map each mention to a particular node in the target KG. The final step is to come up with carefully designed linguistic features based on enrichments done in step 1 and entities using techniques from ML, first order logic or raw rules, to come up with most likely relations. The disambiguated entities and relations are then produced as output triplets (for e.g. subject, predicate, and object) to link into a graph.

The biggest pain point with most of these approaches is the time and effort that they demand from technical specialists who could be data scientists, linguists and knowledge base developers. They need to work closely with domain experts, and come up with raw linguistic features based on enrichments, define domain specific dictionaries and/or first order logic, which can take weeks or months based on the size & complexity, and may still end up missing information due to the amount of variation in the domain.

Leveraging Language Models for Knowledge Graph Construction

More recently, the research community has started exploring how to leverage deep learning to build the linguistic features which were traditionally built by humans. Let’s look at one such paper -"Language Models are Open Knowledge Graphs" by Chenguang Wang, Xiao Liu, Dawn Song. There are a few resources that explain this paper in more detail online, but the paper is also an easy read. I will try to go through the paper high level, and then turn our focus on experimenting with their proposed pipeline.

The main idea behind this paper is to minimize the involvement of humans in the process of creating knowledge graphs from textual data. The authors hypothesize that transformer based models like BERT or GPT2 have the capacity to learn and store knowledge about the domain, which can be converted to a structured knowledge graph. They go on to test this on TAC Knowledge Base Population (KBP) dataset and Wikidata, and publish encouraging results on KG population using the proposed method.

The proposed approach has two steps Match & Map (MaMa). Let’s look at how these modify the pipeline we discussed above:

The place where this approach differs from the previous methods is the use of attention mechanisms to infer candidate triplets in text. The attention weights, which the pre-trained transformer-based models have learnt during training, provide us with insights into the relationship between various terms in text. These weights, applied to anchor terms (noun chunks) from a standard library like spaCy, provide possible candidates facts (head, relation, tail), which can then be disambiguated using an off-the-shelf entity linker like REL.

Following is a diagram taken directly from the paper. It shows how the match process works by running a beam search to find triples with highest aggregate score.

Let’s take this a step further and create a granular version of the proposed pipeline based on the code provided here (the paper does not provide some of the smaller details)

As we see from the flow above, the pipeline involves a lot more than just running the sentences through an LM. While an LM may help us generate candidate facts, it is equally important to tune the other parts of the pipeline. All the other pieces of the pipeline try to leverage pre-trained models from libraries like spaCy, REL and Flair.


The Experiment

Now that we know what the authors of the paper are attempting to achieve, let’s start experimenting with their pipeline, and see if we can achieve the intended result on a couple of small samples.

We will pick one example that has an overlap between the general domain and healthcare:

The second example is healthcare specific:

We will start the experiment by using their default implementation, and then move onto tweaking different steps in the pipeline, visualizing the output as we proceed.

The original code from the paper can be found here: Original Code

I forked the repo and made some changes, namely – wrapper to run and visualize the output using dash, support to run scispaCy Named Entity Recognizer (NER) and Entity Linker, and ability to run different language models supported through the Hugging face transformers library. Here is the link to my forked repo (still in a branch as I add more stuff to experiment, and eventually will be merged into main)

1. Out-of-the-box:

  • Noun Chunks for anchor terms found using en_core_web_md model from spaCy
  • bert-large-cased model for Language Model
  • REL library for entity linking to wikidata 2019 knowledge graph
  • Sample sentence: One thing many people don’t know about actor Robert De Niro – known for his many infamous tough guy roles in films such as "Goodfellas" and "Taxi Driver" – is that he is a 15-year cancer survivor. De Niro was diagnosed with prostate cancer when he was 60 years old in 2003. But thanks to regular cancer screening which led to his doctors catching the disease early, the actor was treated swiftly – and was able to continue his prolific acting career.

As you can see, it found only 1 triplet and was able to link it to wikipedia entities:

head: "Robert_De_Niro", relation: "diagnose", tail: "Prostate_cancer", confidence: 0.16

Note – you can just prefix the entity with https://en.wikipedia.org/wiki/ to explore the wikipedia page for that entity. For example: https://en.wikipedia.org/wiki/Prostate_cancer

2. Leverage Named Entities in addition to Noun Chunks to identify anchor text (Step 3 in Figure 3)

The authors use Noun chunks from spaCy to identify the anchor terms, which are then used to find triplets. We saw that the out-of-the-box implementation generated only one triplet. Most of the prior research uses named entities in this step (see step 1 in Figure 1 that talks about the overall approach), so let’s also use a Named Entity Recognizer (NER) to aid the process. Let’s leverage spaCy to also provide us with Named Entities, in addition to the Noun Chunks (we remove any overlapping portions). Let’s look at what gets generated for the same sentence:

As we see, there is a considerable impact of using Named Entities, as it was able to generate more triplets.

Note: You can also see that the entity linker (REL) does the best it can to identify entities for various mentions, and the fact that we have more mentions than needed can cause noise. For instance we have "many people" as a mention, which was disambiguated to **"[Unforgettable(Nat_King_Colesong)](https://en.wikipedia.org/wiki/Unforgettable(Nat_King_Colesong))", "60 years old" was disambiguated to "Lettrism**". While we have not made an explicit effort to filter out specific entity types from spaCy here, we can do that based on use case requirements, which should remove many of the unwanted mentions, and hence relations. These are typical cost-benefit trade-offs that a use case has to make.

Let’s now try using this on a domain specific example from healthcare:

Conjunctivitis is an inflammation of the thin, clear membrane (conjunctiva) that covers the white of the eye and the inside surface of the eyelids. Conjunctivitis, commonly known as "pink eye," is most often caused by a virus but also can be caused by bacterial infection, allergies (e.g., cosmetics, pollen) and chemical irritation.

The default model en_core_web_md does not produce much this time, even after adding Named Entities to the process –

Let’s switch to a domain specific pre-trained NER, that was trained on a healthcare dataset by allenai and released here : scispaCy

As we see above, the domain specific NER generates more triplets. There are a few incorrect triplets that get extracted too, and a couple of reasons why this may happen –

  1. The threshold used to filter out triplets is low (default of 0.005), which brings back a lot more relations. We will look at the effect of threshold further down in the article.
  2. The confidence in triplets (generated based on attentions) also depends on the type of LM used. This model was primarily trained on a general dataset like wikipedia (not healthcare specifically), and hence may not have seen these terms enough number of times.

Switching to a different scispaCy Model to identify Named Entities gives us significantly different results:

Observation: It is clear that the same Language Model produced many more triplets when we leveraged a relevant named entity model to pick the anchor terms instead of just the noun chunks or using the default NERs from SpaCy. This shows that it is not easy to just plug and play, and it is necessary for a data scientist/NLP expert to pick the right method upfront to identify anchor text, which may need some experimenting or training.

3. Pick a relevant Entity Linker (Step 8 in Figure 3)

Let’s run the domain specific example about Conjunctivitis through the default entity linker (REL). REL has an option to pick the wikipedia corpus year (I picked wiki_2019), which then maps the mentions to entities in the 2019 version of wikipedia. Let’s look at the output:

We see in the figure above that it does a decent job at resolving entities. This is because we picked an example that has well known concepts, which have some reference in the wikipedia KG. Let’s pick a complex sentence from the same domain –

Phenylketonuria (PKU) and mild hyperphenylalaninemia (MHP) are allelic disorders caused by mutations in the gene encoding phenylalanine hydroxylase (PAH).

The above graph was generated by passing the sentence through bert-large-cased model, en_core_scilg model from scispaCy for Named Entities, and REL entity linker. We can see that the entity linker does well on some entities, but fails on others. Entities like "Mutant(Marvel_Comics)" or "National_Movement_Party" are clearly incorrect.

The graph below was generated using the same configuration except the scispaCy entity linker, which was built to identify concepts in the healthcare domain. The scispaCy entity linker also allows us to map the concepts to different knowledge graphs in the medical domain. I picked the UMLs KG as the target. We can see that this entity linker did quite well in disambiguating the nodes. Here, again, tweaking the threshold can remove a number of incorrect triplets, which is discussed in the next section.

Observation: Having access to a relevant entity linker can save a lot of time and effort by identifying the right entities in the target KG. It plays a key role in being able to extract as much automatically, leaving less work for humans that would otherwise need to correct entities and relations

4. Pick the right threshold to filter triplets (Step 7 in Figure 3) and a relevant Language Model (Step 2 in Figure 3)

Finally, let’s investigate the impact of switching language models and the threshold used to filter out triplets. These two aspects are related because picking a language model will determine the average threshold that will work for the filtering process.

The paper provides some insights into the difference between using a BERT like model vs a GPT style model (auto-regressive in nature). They indicate that BERT-like models have better success at identifying relations because they are bidirectional in nature and see tokens in both directions while learning to predict masked tokens. This was consistent in my experiments and hence, I decided to focus on BERT based models for further analysis.

Let’s start with the same example that discusses Conjunctivitis:

Conjunctivitis is an inflammation of the thin, clear membrane (conjunctiva) that covers the white of the eye and the inside surface of the eyelids. Conjunctivitis, commonly known as "pink eye," is most often caused by a virus but also can be caused by bacterial infection, allergies (e.g., cosmetics, pollen) and chemical irritation.

We will experiment with two transformer based models:

Healthcare domain specific language model from Hugging face (Link) : "’bionlp/bluebert_pubmed_mimic_uncased_L-24_H-1024_A-16”

Default model: bert-large-cased

We will also try three values for threshold: 0.005, 0.05 and 0.1

Each graph below is an output from one of the configuration as detailed in the description below it. You can also see the top portion of each Figure to see the other parameters used. The confidence values for each triplet is marked on the corresponding edge with the pattern :

Observation: The graphs are almost identical for the two models when the threshold is set to 0.005 (suggested by the authors). The difference starts showing when we move to a higher threshold of 0.05. The domain specific model (bionlp/bluebert_pubmed_mimic_uncased_L-24_H-1024A-16) displayed more confidence overall in its triplets when the threshold was moved from 0.005 to 0.05. Initially,_ I had assumed that the threshold would just remove some existing triplets and keep the rest as is, but that was a wrong assumption. Most of the triplets had updated scores, higher than earlier, when the threshold was changed from 0.005 to 0.05. This would mean that the beam search, which assigns scores to each triplet, summed up more values for each triplet due to lesser number of candidates.

Overall, there is a tradeoff between threshold and number of triplets, and it is up to the NLP expert to pick the right threshold, which does change the extracted triplets significantly.

Conclusion

It is encouraging to see that language models can be used to extract candidate facts from text, but as we observed in the experiments above, it is equally important for the data scientist/NLP expert to get the rest of the pipeline tuned to the use case.

Getting the right "knowledge" out depends heavily on a lot more factors beyond plugging in a pre-trained LM like:

  1. Identifying the right anchor terms (noun chunks ,named entities etc.)
  2. Configuring the relevant relation types (this aspect was not discussed in the article, but the authors also set constraints on what relations are valid based on predefined dictionaries)
  3. Having access to a relevant pre-trained entity linkers
  4. Tuning the right thresholds for triplets

References

  • Original Paper: Link
  • Original Code: Link
  • Transformer Models from Hugging Face: Link
  • REL Entity Linker Original Paper: Link
  • REL Entity Linker Github: Link
  • scispaCy: Link
  • scispaCy for Bio-medical Named Entity Recognition(NER): Link
  • Dash from Plotly: Link
  • Dash + Plotly for network visualization : Link
  • Data from:

Diabetes: Mechanism, Pathophysiology and Management-A Review

Conjunctivitis

UMLS Knowledge Sources: File Downloads

https://www.kaggle.com/tboyle10/medicaltranscriptions

Related Articles