The world’s leading publication for data science, AI, and ML professionals.

How to Cover Up in Multiple Languages

Applying a polyglot Named Entity Recognition model for anonymization

Thoughts and Theory

Photo by Thirdman from Pexels
Photo by Thirdman from Pexels

In my previous post, Unlocking Inclusivity Possibilities with Polyglot-NER, I introduced polyglot Named Entity Recognition (NER) and its application to anonymizing documents. In part two of this series, I will:

  • Provide a recap of the anonymization problem
  • Discuss two models introduced in my previous post
  • Show different methods of NER evaluation and their drawbacks
  • Present experimental results of applying a polyglot NER model for anonymization of Brazilian Portuguese legal texts

The advance of pre-trained polyglot models available online is exciting. This can lead one to think that for any task and any language it is easy to find a model that will suit your needs. But this is not always the case! We will examine this in this blogpost: we will take a pre-trained polyglot model and compare it to a pre-trained monolingual model in the legal domain.


Problem

Like most NLP applications, NER models are predominantly trained on the English language. Polyglot models, on the other hand, are models that work on multiple languages. These models are sometimes called cross-domain models, where domain refers to the language domain. But the word ‘domain’ can also have a different meaning in NLP, it can refer to a type of text. In NLP not only do we have a lot of languages, but we also have a lot of domains.

The experiment performed in this blog post measures how well a polyglot model can switch to these other types of domains. We are taking a polyglot model trained on a general domain and applying it to a legal domain.


The models: monolingual and polyglot

In my previous post I introduced two models:

  • A mono-lingual model: LENER-BR [1]
  • A polyglot model: XLM-R-NER40 [2]

The mono-lingual model is selected because it came with a Brazilian Portuguese dataset, which is an uncommon language in NLP research. Exactly what we were looking for! This polyglot model is chosen because:

  • It can handle Portuguese (and 39 other languages)
  • The implementation is on Huggingface [3]
  • Performance-wise it is one of the best polyglot models

Because the model is on Huggingface it is very easy to apply. In principal it is as easy as:

from Transformers import pipeline

nlp_ner = pipeline(
    "ner",
    model="jplu/tf-xlm-r-ner-40-lang",
    tokenizer=(
        'jplu/tf-xlm-r-ner-40-lang',  
        {"use_fast": True}),
    framework="tf"
)

And now you are ready to perform NER in 40 languages!

XLM-R-NER-40 was already tested on Portuguese in the general domain. So, we do not want to do that again! Instead, we are testing it in the legal domain. Plu reports the following scores [3]:

Scores on Portuguese reported by Plu [3]
Scores on Portuguese reported by Plu [3]

We are applying the polyglot model to the dataset of the monolingual model. Which is a different dataset than the scores presented by Plu were calculated on. [3]


Evaluation

NER models are used to predict entities in a text, an entity can be privacy sensitive. If we want to redact these sensitive entities, we can use NER models for that.

When anonymizing a document, without regard for entity types, there are three potential outcomes:

  • The model over-redacted, meaning we redacted too many tokens
  • The model under-redacted, meaning we redacted too few tokens
  • The model redacted perfectly 😊

A good evaluation method can quantify the performance of a model for any of the three outcomes. Before we go any further, we need to talk about the IOB (Inside-Outside-Beginning) format, which is the most common used format for annotating named entities.

Often, entities are spread over multiple tokens. A token means a single word or in the case of punctuation a single character. Let us look at an entity from the dataset: "Superior Tribunal Militar".

IOB format of example, image by author
IOB format of example, image by author

In the image above we see the original labelling and the transformed labelling for the first example entity. The top labels are the original labels, for the second entity the transformation is identical except the starting labels are different.

These are two different entity types in this example: an organization and a person. However, in our experiments we are using a binary approach where we’re not interested in the entity type, but only whether an entity occurred. We decided on this approach because the original dataset has more classes than the XLM-R-NER40 model was trained on.

There are three possible outcomes when we want to classify a single NE:

  • The model did not predict any of the tokens to be entities
  • The model predicted the beginning and ending of the entity correctly
  • The model predicted part of the tokens correctly but mismatched the beginning and/or ending

Let us look at a fictional sentence so we can consider some of the intricacies in using NER for anonymization.

Consider the fictional sentence:

"The FBI discovered that Bill Clinton stole a briefcase at the Marriott in Chicago."

When we use our model to anonymize this sentence, the result could look like:

"The X discovered that X Clinton stole a briefcase at the X in X."

All right, we correctly predicted 4 out of the 5 entity tokens, but how useful is our system? Did we anonymize this document completely? Is our model 80% effective? This is a tricky question to answer.

Not every entity carries the same amount of weight, as is obviously clear in the above example. If, instead of ‘Clinton’, the token ‘Chicago’ would have been misclassified, I would argue the sentence is better anonymized. The fact that certain entities are more sensitive than others is not captured in any current NER evaluation. What can be considered is the number of entities a model completely predicted, as we shall see later in this post.


The data used for our experiments come from the LENER-BR dataset and can be found on their repo. I used three documents from their test set.

First let’s look at the accuracy of the models without considering IOB.

Suppose our document consists of 1000 tokens. Each token has its own label, which leaves us with 1000 labels in total. The possible labels are ‘ENT’ and ‘O’. We evaluate the model by comparing each prediction with its corresponding label to see whether they are the same. This is called accuracy.

When evaluating the two NER models on one of the test documents (ACORDAOTCU11602016), we get the following accuracy scores:

  • LENER-BR: 0.991
  • XLM-R-NER: 0.866

At first glance, we see that LENER-BR performs a lot better than XLM-R. But does this mean that LENER-BR is 99% successful in anonymizing a document and that XLM-R is only 87% successful?

To answer this question, we first look at a baseline model. Assume that we have a model that always predicts ‘O’ as label. Admittingly, this is a foolish model and I strongly encourage no one to use it. However, it does give us insight into what the added value of trained models is. For this model and test document, the accuracy score is:

  • Majority baseline: 0.857

These results are not surprising, since most words in a text are not NEs. In some documents, only 1% of tokens are NEs, so getting a 99% accuracy on such a document is extremely easy, but not useful. In our experiment, accuracy is not a useful metric for evaluating NER. We need to look at the usual suspects, precision, recall, and F1. Because these metrics give more information on our performance. For example, it is possible to have very high accuracy but score bad on recall and precision.

We can use the classification report function from Sklearn[4] to obtain the precision, recall, and F1 scores. Let’s look at the report for document ACORDAOTCU11602016:

Classification reports for the two models, image by author
Classification reports for the two models, image by author

The precision tells us how often when the model said it was X, it was actually X. The recall tells us how many of X the system ‘caught’. While this report is not perfect for evaluating NER, it is more informative than using the accuracy score. For example, the difference in the performance of XLM-R-NER40 on B-ENT and I-ENT tells us that this model has more difficulty recognizing the beginning of an entity than the other tokens that belong to an entity. It tells us that XLM-R-NER40 is only catching the beginning of an entity 34% of the time.

All metrics discussed up to this point are applied to individual tokens, rather than multi-token entities. For proper anonymization, the ability to identify complete multi-token entities (e.g., ‘Bill Clinton’) is of utmost importance, Luckily, the seqeval [5] package can measure this for us. It produces a classification report, just like sklearn, but is a lot stricter. Which also has its downsides. Let’s look at the following example consisting of two short sentences:

gold_label = [
['B-ENT', 'O', 'O', 'O', 'O'], 
['O', 'O', 'B-ENT', 'I-ENT', 'I-ENT', 'O', 'O']
]
predictions = [
['B-ENT', 'I-ENT', 'O', 'O', 'O'],
['O', 'O', 'B-ENT', 'I-ENT', 'O', 'O', 'O']
]

We have two sentences and in both there is one entity. In example (1) the first entity is ‘over-predicted’, while in (2) the entity is ‘under-predicted’. When we apply seqeval to these examples, we get:

seqeval scores for two examples, image by author
seqeval scores for two examples, image by author

Ouch, according to the report we did just as bad as predicting only ‘O’. As said earlier, seqeval measures how many times we predicted the entire multi-token entity correctly. We can see some of the drawbacks of using it for our experiments already, as it is not completely fair to say that predicting only ‘O’ performs equally well as the predictions above.


Results

Given the drawbacks of each method, I am going to show both the seqeval and the sklearn reports. To get the main point you can skip right away to results table 4. In table 2 we do not need to show the macro/weighted average because it is equivalent to the precision, recall, f1-score of the single class.

Results table 1: sklearn report for XLM-R-NER40, image by author
Results table 1: sklearn report for XLM-R-NER40, image by author
Results table 2: sklearn report for LENER-BR, image by author
Results table 2: sklearn report for LENER-BR, image by author
Results table 3: seqeval report for both models, image by author
Results table 3: seqeval report for both models, image by author
Results table 4: Mean f1-scores for both models, image by author
Results table 4: Mean f1-scores for both models, image by author

Both types of classification reports clearly show that the monolingual model performs a lot better. Which is not surprising given it is specifically developed for this task. If we look at results table 4 we see that using the sklearn evaluation method the monolingual model performs about twice as good as the polyglot model. But, using the seqeval method the monolingual model performs more than 4.5 times as good. The results show the limits of the XLM-R-NER40 model on legal texts. This does not mean it is a bad model, but it is important to realize there is a limit to what we can do with out-of-the-box models.


Summary & conclusion

To summarize, in this post we have seen:

  • The impact of a domain switch
  • How, with relative ease, we can find polyglot and monolingual models online
  • How we can apply a general model to a more specific domain with more classes through binarization
  • The difficulties in evaluating NER, especially in the context of anonymization

We have seen the impact of the domain switch because the original scores on the original domain were a lot higher than the scores on the legal domain. The two models used for this experiment were easy to apply and find online because one of them is on Huggingface and the other one can be found on GitHub. We must conclude that there are limits to taking a polyglot model out of the box and applying it to any task. This shows two things, we still need ML engineering expertise, and we need to work more to create models that can work on any domain, for any language. This way anyone can make use of these models with minimal ML engineering knowledge.

For future work it would be interesting to rewrite evaluation methods in such a way that it is specific for anonymization. This would have to capture that recall is more important than precision while still considering that over-anonymization is still undesirable. If someone is really interested in using XLM-R-NER40 for the legal domain, it would be interesting to see what would happen to the results if the model would be fine-tuned for the legal domain.

If you’d like to learn more about Slimmer AI and the latest research from our AI Innovation team Fellowship, visit our website or our AI Innovation page on Medium.


References

[1] Luz de Araujo, P. H., de Campos, T. E., de Oliveira, R. R. R., Stauffer, M., Couto, S., & Bermejo, P. (2018). LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text. Computational Processing of the Portuguese Language. Springer. doi: 10.1007/978–3–319–99722–3_32

[2] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., …Stoyanov, V. (2019). Unsupervised Cross-lingual Representation Learning at Scale. arXiv, 1911.02116. Retrieved from https://arxiv.org/abs/1911.02116v2

[3] jplu/tf-xlm-r-ner-40-lang · Hugging Face. (2021, June 22). Retrieved from https://huggingface.co/jplu/tf-xlm-r-ner-40-lang

[4] Hiroki Nakayama. (2018). seqeval: A Python framework for sequence labeling evaluation.

[5] sklearn.metrics.classification_report. (2021, October 21). Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html


Related Articles