Keyphrase Extraction with BERT Transformers and Noun Phrases

Using noun phrase preprocessing to enhance BERT-based keyword extraction

Published in

Towards Data Science

8 min readFeb 7, 2022

This post is based on our paper “PatternRank: Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction (2022)”. You can read more details about our approach there or in our PatternRank blog post.

To get a quick overview of text content, it can be helpful to extract keywords that concisely reflect its semantic context. Although the commonly used term is keywords, we usually actually want keyphrases for this purpose.

Keywords or keyphrases should both describe the essence of what the text is about. The difference between the two is that keywords are single words, while keyphrases are made up of a few words. E.g. “puppy” vs. “puppy obedience training”. -Iris Guelen

Keyphrases provide a more accurate description than simple keywords and are therefore often the preferred choice. Thankfully, many open source solutions exist that allow us to automatically extract keyphrases from text. One of the recently very popular solutions is KeyBERT. It is an easy-to-use Python package for keyphrase extraction with BERT language models. Shortly explained, KeyBERT works by first creating BERT embeddings of document texts. Afterwards, BERT keyphrase embeddings of word n-grams with predefined lengths are created. Finally, cosine similarities between document and keyphrase embeddings are calculated to extract the keyphrases that best describe the entire document. A more detailed introduction to KeyBERT is available here.

Why KeyBERT results need to be enhanced

Although KeyBERT is capable of extracting good keyphrases on its own, in practice there still occur two issues. This is caused by the way KeyBERT extracts keyphrases from documents prior to the embedding step. Users need to predefine a word n-gram range to specify the length of extracted keyphrases. KeyBERT then extracts simple word n-grams of the defined length from documents and uses them as candidate keyphrases for embedding creation and similarity calculation.

A word n-gram range lets users decide the length of the sequence of consecutive words that should be extracted from a given text. Let’s suppose we define a word n-gram range = (1,3). Then we would choose to extract the unigrams (only single word), bigrams (group of two consecutive words), and the trigrams (group of three consecutive words) from the text. Applying the word n-gram range to"an apple a day keeps the doctor away" will result in ["an", "apple", "a","day", "keeps", "the", "doctor", "away", "an apple", "apple a", "a day", "day keeps", "keeps the", "the doctor", "doctor away", "an apple", "apple a day", "a day keeps", "day keeps the", "keeps the doctor", "the doctor away"] .-Devish Parmar

However, users usually do not know the optimal n-gram range and therefore have to spend some time experimenting until they find a suitable n-gram range. Furthermore, this means that grammatical sentence structures are not considered at all. This leads to the effect that even after finding a good n-gram range, the returned keyphrases are sometimes still grammatically incorrect or slightly off-key. Continuing the above example, it wouldn’t be very desirable if KeyBERT identifies the keyphrases “day keeps” or “keeps the doctor” as most important from the set of candidate keyphrases.

How to enhance KeyBERT results with KeyphraseVectorizers

To address the issues mentioned above, we can use the KeyphraseVectorizers package together with KeyBERT. The KeyphraseVectorizers package extracts keyphrases with part-of-speech patterns from a collection of text documents and converts them into a document-keyphrase matrix. A document-keyphrase matrix is a mathematical matrix that describes the frequency of keyphrases that occur in a collection of documents.

How does the KeyphraseVectorizer package work?

First, the document texts are annotated with spaCy part-of-speech tags. Second, keyphrases are extracted from the document texts whose part-of-speech tags match a predefined regex pattern. By default, the vectorizers extract keyphrases with zero or more adjectives, followed by one or more nouns using the English spaCy part-of-speech tags. Finally, the vectorizers calculate document-keyphrase matrices. Apart from the matrices, the package can also provide the keyphrases extracted via part-of-speech.

Example:

We can install the KeyphraseVectorizers package with the following command: pip install keyphrase-vectorizers.

{'binary': False, 'dtype': <class 'numpy.int64'>, 'lowercase': True, 'max_df': None, 'min_df': None, 'pos_pattern': '<J.*>*<N.*>+', 'spacy_pipeline': 'en_core_web_sm', 'stop_words': None, 'workers': 1}

By default, the vectorizer is initialized for the English language. That means, an English spacy_pipeline is specified, Englishstop_words are removed, and the pos_pattern extracts keywords with zero or more adjectives, followed by one or more nouns using the English spaCy part-of-speech tags.


[[0 0 0 0 1 3 2 1 1 0 1 1 3 1 0 0 0 0 1 0 1 1 1 0 1 0 2 0 1 1 1 0 1 1 0 0 0 1 1 3 3 0 1 3 3]
 [1 1 5 1 0 0 0 0 0 1 0 0 0 0 1 1 1 1 0 1 0 0 0 2 0 1 0 1 0 0 0 2 0 0 1 1 1 0 0 0 0 5 0 0 0]]

['users' 'main topics' 'learning algorithm' 'overlap' 'documents' 'output' 'keywords' 'precise summary' 'new examples' 'training data' 'input' 'document content' 'training examples' 'unseen instances' 'optimal scenario' 'document' 'task' 'supervised learning algorithm' 'example' 'interest' 'function' 'example input' 'various applications' 'unseen situations' 'phrases' 'indication' 'inductive bias' 'supervisory signal' 'document relevance' 'information retrieval' 'set' 'input object' 'groups' 'output value' 'list' 'learning' 'output pairs' 'pair' 'class labels' 'supervised learning' 'machine' 'information retrieval environment' 'algorithm' 'vector' 'way']

The output of the vectorizer shows that the extracted words, unlike simple n-grams, are grammatically correct and make sense. This results from the vectorizer extracting noun phrases and expanded noun phrases.

A noun phrase is a simple phrase built around a noun. It contains a determiner and a noun. For example: a tree, some sweets, the castle. An expanded noun phrase adds more detail to the noun by adding one or more adjectives. An adjective is a word that describes a noun. For example: a huge tree, some colourful sweets, the large, royal castle. -BBC

How to use KeyphraseVectorizers with KeyBERT?

The keyphrase vectorizers can be used together with KeyBERT to extract grammatically correct keyphrases that are most similar to a document. Thereby, the vectorizer first extracts candidate keyphrases from the text documents, which are subsequently ranked by KeyBERT based on their document similarity. The top-n most similar keyphrases can then be considered as document keywords.

The advantage of using KeyphraseVectorizers in addition to KeyBERT is that it allows users to get grammatically correct keyphrases instead of simple n-grams of pre-defined lengths.

The KeyphraseVectorizers first extract candidate keyphrases that consist of zero or more adjectives, followed by one or multiple nouns in a preprocessing step instead of simple n-grams. TextRank, SingleRank, and EmbedRank successfully used this noun phrase approach for keyphrase extraction. The extracted candidate keyphrases are subsequently passed to KeyBERT for embedding generation and similarity calculation. To use both packages for keyphrase extraction, we need to pass KeyBERT a keyphrase vectorizer with the vectorizer parameter. Since the length of keyphrases now depends on part-of-speech tags, there is no need to define an n-gram length anymore.

Example:

KeyBERT can be installed via pip install keybert.

Instead of deciding on a suitable n-gram range which could be e.g.(1,2)…

[[('labeled training', 0.6013),
  ('examples supervised', 0.6112),
  ('signal supervised', 0.6152),
  ('supervised', 0.6676),
  ('supervised learning', 0.6779)],
 [('keywords assigned', 0.6354),
  ('keywords used', 0.6373),
  ('list keywords', 0.6375),
  ('keywords quickly', 0.6376),
  ('keywords defined', 0.6997)]]

we can now just let the keyphrase vectorizer decide on suitable keyphrases, without limitations to a maximum or minimum n-gram range. We only have to pass a keyphrase vectorizer as parameter to KeyBERT:

[[('learning', 0.4813), 
  ('training data', 0.5271), 
  ('learning algorithm', 0.5632), 
  ('supervised learning', 0.6779), 
  ('supervised learning algorithm', 0.6992)], 
 [('document content', 0.3988), 
  ('information retrieval environment', 0.5166), 
  ('information retrieval', 0.5792), 
  ('keywords', 0.6046), 
  ('document relevance', 0.633)]]

This allows us to make sure that we do not cut off important words caused by defining our n-gram range too short. For example, we would not have found the keyphrase “supervised learning algorithm” with keyphrase_ngram_range=(1,2). Furthermore, we avoid to get keyphrases that are slightly off-key like “labeled training”, “signal supervised” or “keywords quickly”.

Extraction of keyphrases in languages other than English:

We can also apply this approach to other languages, such as German. This requires only some parameter modifications for KeyphraseVectorizers and KeyBERT.

For the KeyphraseVectorizers, the spacy_pipeline and stop_words parameters need to be modified to spacy_pipeline=’de_core_new_sm’ and stop_words=’german’. Because the German spaCy part-of-speech tags differ from the English ones, the pos_pattern parameter needs modification. The regex pattern <ADJ.*>*<N.*>+ extracts keywords with zero or more adjectives, followed by one or more nouns using the German spaCy part-of-speech tags.

For KeyBERT, the Flair package needs to be installed via pip install flair and a German BERT model has to be selected.

[[('schwester cornelia', 0.2491),
  ('neigung', 0.2996),
  ('angesehenen bürgerlichen familie', 0.3131),
  ('ausbildung', 0.3651),
  ('straßburg', 0.4022)],
 [('tochter', 0.0821),
  ('friedrich schiller', 0.0912),
  ('ehefrau elisabetha dorothea schiller', 0.0919),
  ('neckar johann kaspar schiller', 0.092),
  ('wundarztes', 0.1334)]]

Summary

KeyphraseVectorizers is a recently released package that can be used in addition to KeyBERT to extract enhanced keyphrases from text documents. This approach eliminates the need for user-defined word n-gram ranges and extracts grammatically correct keyphrases. Furthermore, the approach can be applied to many different languages. Both open-source packages are easy to use and allow precise keyphrase extraction in just a few lines of code.

Big thanks also to Maarten Grootendorst, who gave me input and inspiration while writing the KeyphraseVectorizers package.

Sources

GitHub - TimSchopf/KeyphraseVectorizers: Set of vectorizers that extract keyphrases with…

Set of vectorizers that extract keyphrases with part-of-speech patterns from a collection of text documents and convert…

github.com

GitHub - MaartenGr/KeyBERT: Minimal keyword extraction with BERT

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and…

github.com

PatternRank: Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase…

Keyphrase extraction is the process of automatically selecting a small set of most relevant phrases from a given text…

arxiv.org

Mihalcea, R. and Tarau, P. (2004). TextRank: Bringing or-
der into text. In Proceedings of the 2004 Conference
on Empirical Methods in Natural Language Process-
ing, pages 404–411, Barcelona, Spain. Association for
Computational Linguistics.

Wan, X. and Xiao, J. (2008). CollabRank: Towards a
collaborative approach to single-document keyphrase
extraction. In Proceedings of the 22nd Interna-
tional Conference on Computational Linguistics (Col-
ing 2008), pages 969–976, Manchester, UK. Coling
2008 Organizing Committee.

Bennani-Smires, K., Musat, C., Hossmann, A., Baeriswyl,
M., and Jaggi, M. (2018). Simple unsupervised
keyphrase extraction using sentence embeddings. In
Proceedings of the 22nd Conference on Computa-
tional Natural Language Learning, pages 221–229,
Brussels, Belgium. Association for Computational
Linguistics.