Implementing Hearst Patterns with SpaCy

Automatic Extraction of hypernym (hyponym) relation

Nikita Kiselov
Towards Data Science

--

⚠️ In this article, I will mostly concentrate on the Hearst patterns, implementation and usage for hypernym extraction. However, I will use Named Entity Recognition (NER) and a dataset of patents; so I recommend checking my previous post in this cycle.

Patterns … patterns are everywhere…

Why do we care about patterns in the context of NLP? Because they significantly reduce and simplifies work, basically, it is a simple model. Despite being in the era of Transformer Neural Networks, patterns still can be beneficial. Automatic hypernym extraction has been a dynamic area of research for around 20 years. This is a crucial tool when applied to downstream tasks such as question answering, queries, inf. extraction, etc.

The usefulness of hyponym relationship

hypernym — …a word with a broad meaning constituting a category into which words with more specific meanings fall;

hyponym — … is a reverse meaning; a word of more specific meaning than a general term applicable to it.

Let’s make an example for a clear understanding:

Here, “CD” and “hard drive” is a hyponym of “storage units”. In reverse, “storage units” is a hypernym of “CD” and “hard drive”.

Such lexical relation is an essential building block for NLP tasks. The variety of these tasks depends on the goal and can be such as:

  • Taxonomy prediction: identifying broader categories for the terms, building taxonomy relations (like WikiData GraphAPI)
  • Information extraction (IE): automated retrieval of the specific information from text is highly reliable on relation to searched entities.
  • Dataset creation: advanced models need examples to be learned to identify the relationships between entities.

Hearst patterns

So, how can we detect and extract such a relation? That’s time to talk about the work of computational linguistics researcher Marti Hearst. One of her most popular studies focuses on building a set of test patterns that can be employed to extract meaningful information from text. These patterns are popularly known as “Hearst Patterns”.

We can formalise this pattern as “X which is an Y”, where X is the hypernym and Y is the hyponym. This was one of the many patterns from the Hearst Patterns. Here’s a list to give you an intuition behind the idea:

Image by the Author | Table of patterns to detect hyper.\rhyper relation

These table patterns are categorised by hyper and rhyper (reversed-hypernym). Usually, the order is unimportant, but sometimes it is pretty helpful for training Information extraction systems.

You can probably argue that such an approach looks outdated and oversimplified nowadays, that we can use ML and complex models.

BUT, that is not totally true!

In this paper from FB(Meta) research team, they showed that

“…simple pattern-based methods consistently outperform distributional methods on common benchmark datasets.”

Sometimes, good old reliable tools are more than enough 🛠

Implementation with SpaCy

Moving from theory to practice. Usually, you don’t want to extract all possible hyponyms relations, but only entities in the specific domain. Recognition of entities in the particular domain is called NER. The simplest way by far is using SpaCy. With this library, you can train a custom NER model to recognise more specific domains than the default one.

Image by the Author | Result of custom NER model from my previous post

Data

As an example, I will use texts of Patents in the G06K (Recognition of data/Presentation of data) subsection of patents. On top of it, I trained a custom NER model to recognise technical terms. I described this dataset in detail in my previous post.

⚠️ Data is copyright free and safe to use for commercial purposes. Accoding to USPTO : “Subject to limited exceptions reflected in 37 CFR 1.71(d) & (e) and 1.84(s) , the text and drawings of a patent are typically not subject to copyright restrictions.”

Implementation

The creation of patterns inside SpaCy is pretty straightforward. Since we are using the NER model, we can rely on recognition for filtering entities that are out of our domain of interest.

Patterns can be created in JSON format. Here is an example of a bunch of them based on the Rule matching documentation of SpaCy.

Image by the Author | Example of the patterns in JSON format

You can see that by specifying ENT_TYPE we are utilising the NER model to match only words in this domain.

Implementation on Python is pretty straightforward. We read the text, initialise matcher, read patterns from JSON and add them to the matcher.

Code snippet of loading patterns into SpaCy matcher

Simply, by doing matcher(doc), we extract the list of hypernym relations. Together with extracted patterns, we got some info about matches, like names of the pattern (hyper\rhyper in our case) and is it a multiword relation.

Code snippet of utilising matcher on the text span

Multiword patterns

The most common problem we faced with matcher and patterns is multiword hypernym relation.

Image by the Author | Examples of possible hyper. relation with multiple entities

Since the matcher can’t recognise various entities under one pattern, here we propose a hint that can be useful ;)

After finding the matched pattern, we move further and check other entities in the sentence. If they are under our domain and placed between connection words, those words are also part of the hyper\rhyper relation.

Image by the Author | Visual illustration of the multiword relation matching

The main trick in the code is that we create a list with ‘continue words’ and check the sentence with multiple matches of entities.

Code snippet of patterns extraction with multiple entities

Results and notes

Voilà ✨! We extracted hypernym relations in the custom domain.

Image by the Author | Final result table of the extracted hyper. relations

The full code with an in-detail notebook and dataset you can find here:

Even though we already have results, it would be nice to validate them. In the next and last of this “patents” series of posts, I will show how to automatically validate extracted hypernym relation on any custom dataset using Wiki API. Stay tuned and follow 😉

Acknowledgement

Special thanks to my team at this project: Marwan MASHRA and Gaëtan SERRÉ.

--

--

Applied Scientist 👨‍🔬 | MSc AI @ Université Paris-Saclay 🇫🇷 | Ukrainian 🇺🇦