A comparison of different NER models and their ability to detect different named entities.
This blog post highlights an issue with spaCy and other Named Entity Recognition models not being able to accurately detect person names, especially if they are biblical names. The detection differences between regular names and biblical names are quite overwhelming.
You’ll see that for the simplest example we can think of, "My name is X", biblical names are almost never detected as names of persons.
I tried to get to the bottom of this and believe I have an answer. But first, let’s do a short experiment with two spaCy models (using spaCy version 3.0.5).
Compare detection rates of biblical vs. other names
Why is there a difference in the first place? The reason for the different detection rates could arise from:
- The fact that biblical names are sometimes older and less common (therefore might be less frequent in the dataset the model was trained on).
- That the surrounding sentence is less likely to co-occur with the specific name on the original dataset.
- Issue with the dataset itself (such as wrong annotations by humans).
To (simplistically) test these hypotheses, we compare biblical names with both old and new names, and three templates, two of which are from the bible:
- "My name is X"
- "And X said, Why hast thou troubled us?"
- "And she conceived again, a bare a son; and she called his name X."
Let’s start by creating name lists and templates:
Method for running the spaCy model and checking if "PERSON" was detected:
Model 1: spaCy’s en_core_web_lg
model
- This model uses the original (non-transformers based) spaCy architecture.
- It was trained on the OntoNotes 5.0 dataset and features 0.86 F-measure on named entities.
Loading the model:
Let’s run it:
Here are the results we get:
Model name: en_core_web_lg
Name set: Biblical, Template: "My name is {}"
Recall: 0.25
Name set: Other, Template: "My name is {}"
Recall: 0.94
Name set: Biblical, Template: "And {} said, Why hast thou troubled us?"
Recall: 0.67
Name set: Other, Template: "And {} said, Why hast thou troubled us?"
Recall: 0.94
Name set: Biblical, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 0.58
Name set: Other, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 0.94
Detailed results:
{('And she conceived again, a bare a son; and she called his name {}.', 'Biblical'): {
'Abraham': True,
'David': True,
'Isaac': False,
'Jacob': False,
'Jesus': False,
'John': True,
'Judas': False,
'Mary': True,
'Matthew': True,
'Moses': False,
'Samuel': True,
'Simon': True},
('And she conceived again, a bare a son; and she called his name {}.', 'Other'): {
'Ariana': True,
'Barack': True,
'Beyonce': True,
'Bill': True,
'Charles': True,
'Coby': False,
'Donald': True,
'Frank': True,
'George': True,
'Helen': True,
'Joe': True,
'Katy': True,
'Lebron': True,
'Margaret': True,
'Robert': True,
'Ronald': True,
'William': True},
('And {} said, Why hast thou troubled us?', 'Biblical'): {
'Abraham': True,
'David': True,
'Isaac': True,
'Jacob': False,
'Jesus': False,
'John': True,
'Judas': False,
'Mary': True,
'Matthew': True,
'Moses': False,
'Samuel': True,
'Simon': True},
('And {} said, Why hast thou troubled us?', 'Other'): {
'Ariana': True,
'Barack': True,
'Beyonce': True,
'Bill': True,
'Charles': True,
'Coby': False,
'Donald': True,
'Frank': True,
'George': True,
'Helen': True,
'Joe': True,
'Katy': True,
'Lebron': True,
'Margaret': True,
'Robert': True,
'Ronald': True,
'William': True},
('My name is {}', 'Biblical'): {
'Abraham': True,
'David': False,
'Isaac': False,
'Jacob': False,
'Jesus': False,
'John': False,
'Judas': False,
'Mary': True,
'Matthew': True,
'Moses': False,
'Samuel': False,
'Simon': False},
('My name is {}', 'Other'): {
'Ariana': True,
'Barack': True,
'Beyonce': True,
'Bill': True,
'Charles': True,
'Coby': False,
'Donald': True,
'Frank': True,
'George': True,
'Helen': True,
'Joe': True,
'Katy': True,
'Lebron': True,
'Margaret': True,
'Robert': True,
'Ronald': True,
'William': True}}
So there’s a pretty big difference between biblical names detection and other names.
Model 2: spaCy’s en_core_web_trf
model
spaCy recently released a new model, en_core_web_trf
, based on the huggingface transformers library, and also trained on OntoNotes 5.
Let’s try this model:
This time we get:
Model name: en_core_web_trf
Name set: Biblical, Template: "My name is {}"
Recall: 0.50
Name set: Other, Template: "My name is {}"
Recall: 1.00
Name set: Biblical, Template: "And {} said, Why hast thou troubled us?"
Recall: 0.00
Name set: Other, Template: "And {} said, Why hast thou troubled us?"
Recall: 0.11
Name set: Biblical, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 0.00
Name set: Other, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 0.50
Detailed results:
{('And she conceived again, a bare a son; and she called his name {}.', 'Biblical'): {
'Abraham': False,
'David': False,
'Isaac': False,
'Jacob': False,
'Jesus': False,
'John': False,
'Judas': False,
'Mary': False,
'Matthew': False,
'Moses': False,
'Samuel': False,
'Simon': False},
('And she conceived again, a bare a son; and she called his name {}.', 'Other'): {
'Ariana': True,
'Barack': True,
'Beyonce': True,
'Bill': False,
'Charles': False,
'Coby': False,
'Donald': True,
'Frank': True,
'George': False,
'Helen': False,
'Joe': True,
'Katy': True,
'Lebron': False,
'Margaret': False,
'Robert': False,
'Ronald': True,
'William': False},
('And {} said, Why hast thou troubled us?', 'Biblical'): {
'Abraham': False,
'David': False,
'Isaac': False,
'Jacob': False,
'Jesus': False,
'John': False,
'Judas': False,
'Mary': False,
'Matthew': False,
'Moses': False,
'Samuel': False,
'Simon': False},
('And {} said, Why hast thou troubled us?', 'Other'): {
'Ariana': False,
'Barack': True,
'Beyonce': True,
'Bill': False,
'Charles': False,
'Coby': False,
'Donald': False,
'Frank': False,
'George': False,
'Helen': False,
'Joe': False,
'Katy': False,
'Lebron': False,
'Margaret': False,
'Michael': False,
'Robert': False,
'Ronald': False,
'William': False},
('My name is {}', 'Biblical'): {
'Abraham': False,
'David': True,
'Isaac': True,
'Jacob': False,
'Jesus': False,
'John': True,
'Judas': False,
'Mary': True,
'Matthew': True,
'Moses': False,
'Samuel': True,
'Simon': False},
('My name is {}', 'Other'): {
'Ariana': True,
'Barack': True,
'Beyonce': True,
'Bill': True,
'Charles': True,
'Coby': True,
'Donald': True,
'Frank': True,
'George': True,
'Helen': True,
'Joe': True,
'Katy': True,
'Lebron': True,
'Margaret': True,
'Robert': True,
'Ronald': True,
'William': True}}
Although the numbers are different, we still see a difference between the two sets. However, this time it seems that old names (like Helen, William or Charles) are something the model is also struggling with.
So what’s going on here?
As part of our work on Presidio (a tool for data de-identification), we develop models to detect PII entities. For that purpose, we extract template sentences out of existing NER datasets, including CONLL-03 and OntoNotes 5. The idea is to augment these datasets with additional entity values, for better coverage of names, cultures and ethnicities. In other words, every time we see a sentence with a tagged person name on a dataset, we extract a template sentence (e.g. The name is [LAST_NAME], [FIRST_NAME] [LAST_NAME]
) and later replace it with multiple samples each containing different first and last names.
When we manually went over the templating results, we figured out that there are still many names in our new templates dataset which didn’t turn into templates. A majority of these names came from the biblical sentences that OntoNotes 5 contains. So many of the samples in the OntoNotes 5 did not contain any PERSON labels, even though they did contain names, an entity type the OntoNotes dataset claims to support. It seems like these models actually learn the errors in the dataset, in this case to ignore names if they are biblical.
Obviously, these errors are found in both the train and test set, so a model that would learn that biblical names are not really names would also succeed on a similar test set. This is yet another example why SOTA results are not necessarily the best way to show progress in science.
Is it only spaCy?
A similar evaluation on two Flair models show that the a model trained on OntoNotes achieves significantly lower results on this test. The CONLL based model actually does pretty well!
CONLL-03 based model results:
Model name: ner-english (CONLL)
Name set: Biblical, Template: "My name is {}"
Recall: 1.00
Name set: Other, Template: "My name is {}"
Recall: 1.00
Name set: Biblical, Template: "And {} said, Why hast thou troubled us?"
Recall: 1.00
Name set: Other, Template: "And {} said, Why hast thou troubled us?"
Recall: 1.00
Name set: Biblical, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 1.00
Name set: Other, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 0.94
OntoNotes based model results:
Model name: ner-english-ontonotes
Name set: Biblical, Template: "My name is {}"
Recall: 0.50
Name set: Other, Template: "My name is {}"
Recall: 1.00
Name set: Biblical, Template: "And {} said, Why hast thou troubled us?"
Recall: 0.00
Name set: Other, Template: "And {} said, Why hast thou troubled us?"
Recall: 0.83
Name set: Biblical, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 0.00
Name set: Other, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 0.00
Conclusion
spaCy is one of the most exciting things happening in NLP today and it’s considered one of the most mature, accurate, fast and well documented NLP libraries in the world. As shown with the Flair example, this is an inherent problem in ML models and especially ML datasets.
Three relevant pointers to conclude:
- Andrew Ng recently argued that the ML community should be more data-centric and less model-centric. This post is another example of why this is true.
- This is another example of an issue with a major ML dataset.
- A tool like Checklist is really helpful to validate that your model or data doesn’t suffer from similar issues. Make sure you check it out.
A Jupyter notebook for this blog post can be found here.
About the author: Omri Mendels is a Principal Data Scientist at Microsoft.