The world’s leading publication for data science, AI, and ML professionals.

Remove personal information from a text with Python – Part II – NER

Implementation of a privacy filter in Python that removes Personal Identifiable Information (PII) with Named Entity Recognition (NER).

Photo from Michael Dziedciz on Unsplash
Photo from Michael Dziedciz on Unsplash

This is a follow-up on my previous article on the removal of personal information from texts.

The Gdpr is the General Data Protection Regulation by the European Union. Its purpose is to protect data of all European residents. Protecting data is also an intrinsic value of a developer. Protecting data in a row/column data structure is relative easy by controlling access to columns and rows. But what about free text?

In my previous article I described a solution based on the usage of regular expressions and a list of forbidden words. In this article we add an implementation based on Named Entity Recognition (NER). The full implementation can be found in the github PrivacyFilter project.

What is Named Entity Recognition?

According to wikipedia NER is:

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

So it is all about finding and identifying entities in texts. An entity can be a single word or a series of consecutive words. An entity is classified into a pre-defined category. For example, in the sentence below, three entities are found: The entity person "Sebastian Thrun", the entity organisation "Google" and the entity date "2007".

Example entity recognition (source: Spacy.io)
Example entity recognition (source: Spacy.io)

NER is a subset of the artificial intelligence field of Natural Language Processing (NLP). This field contains algorithms processing and analysing natural languages. When NER is capable of identifying entities in natural language, these entities can be removed from the text in case it is a Privacy related entity like person, organisation, date or location.

Filter PII with NER

First we need a NLP processing package. NLP packages are trained per language since all languages have their own grammar. We are working with Dutch so we need one that understands this. We will be using Spacy for our privacy filter.

On the Spacy website a tool can be found that helps installing Spacy. After selecting your Python environoment and language, it gives the appropriate commands to install Spacy:

Spacy install tool (source: Spacy.io)
Spacy install tool (source: Spacy.io)

The selected pipeline (efficiency or accuracy) determines the accuracy of the NER model versus the size and speed. Selecting ‘efficiency’ results in a smaller and faster model but with lower accuracy compared to ‘accuracy’. It depends on your use case which model is more appropriate. For development we choose to use the efficiency model. Running a first NER analysis:

After importing the Spacy package in line 2, a model is loaded using the spacy.load() method. In this case the efficient model for Dutch is loaded. A model is specified by its name, which is identical with the name used to download the model in the previous step. To switch to the accurate Dutch language model replace "nl_core_news_sm" with "nl_core_news_lg". For the example above this results in the same output.

A quick, simple performance test shows that loading the small model takes ~2.0 seconds and the large model takes ~4.5 seconds. Analysing a sentence is 5.5 milliseconds versus 6.0 milliseconds. The large model seems to take approximately 500 MB memory extra.

The meaning of the Part of Speech (POS) tags can be found on this site. For our example, they are:

Geert PROPN PERSON     Proper noun, person
werkt VERB             Verb
sinds ADP              Adposition, case marking
2010  NUM DATE         Numeral, date
voor  ADB              Adposition
HAL   PROPN ORG        Proper noun, organisation
.     PUNCT            Punctuation

For filtering PII we are interested in the POS types NUM and PROPN. We will replace the POS text elements with a tag describing their entity type .

The first part of the code loads the language model and parses the input string to a list of tokens (doc). The loop in lines 8–16 build the filtered text by iterating over all tokens in the document. If a token is of type PROPN, NOUN or NUMBER it is replaced with a tag <…>, where the tag is equal to the entity type recognized by Spacy. All tokens are, with a prefix space, concatenated to the new string. The prefix is required since tokenizing the string has removed these. In case of a punctuation symbol, no prefix space is added (line 12–13).

After the loop the first character of the new string is a space due to line 11 or 13 so we need to remove this space (line 17). This results in the string without privacy information.

How good is it?

In the previous article we have build a privacy filter based on a forbidden word list. That apprach requires more code and effort compared to NER. But how do they compare?

  • NER requires grammatically correct sentences. In that case, replacement of privacy information works well, even if names are misspelled. NER is superior to the forbidden word list.
  • The forbidden word filter will replace forbidden words, no matter what there context is. Especially the list of street names and city names results in a lot of unnecessary deleted words. E.g., words like plant names, animals or ojects like Castle are common as street names and will be removed from the text. This might remove a lot of unnecessary words, reducing the usability of the resulting text. NER will perform better.
  • If the text is not grammatically correct (e.g. the answer ‘Peter’ to the question ‘What is your name?’ will not be filtered correct by NER. These sentences are common in chat messages and transcripts of conversations. The NER approach will fail in these cases since the NER algorithm cannot determine the nature of these answers with one or a few words.

So it all depends on your use case and required level of filtering. This combination determines if the best approach is to use the forbidden list version, the NER version or even a combination of the two. The latter will combine the advantages of both approaches (but also part of their weaknesses). To find the best approach, take a subset of your data to filter and test different algorithms and/or combinations to find the best fitting one.

Some examples to compare NER with the forbidden word list (FWL):

INPUT: Geert werkt sinds 2010 voor HAL.
NER  : <FILTERED> werkt sinds <FILTERED> voor <FILTERED>.
FWL  : <FILTERED> werkt sinds <FILTERED> voor HAL.
INPUT: Heert werkt sinds 2010 voor HAL.
NER  : <FILTERED> werkt sinds <FILTERED> voor <FILTERED>.
FWL  : Heert werkt sinds <FILTERED> voor HAL.
INPUT: Wat is je naam? Geert.
NER  : Wat is je naam? Geert.
FWL  : Wat is je naam? FILTERED.
INPUT: Geert kijkt naar de duiven op het dak.
NER  : <FILTERED> kijkt naar de duiven op het dak.
FWL  : <FILTERED> kijkt naar de <FILTERED> op het dak.

(all tags like are replaced with the generic tag for ease of comparison)

The first example shows tat FWL cannot remove company names since it has no list of company names. The NER algorithm has determined on the sentence that ‘HAL’ is a noun en more specific an organisation.

The second example shows that NER can handle a type error in the name since it looks at the structure of the sentence while FWL does not recognize ‘Heert’ as a name. The list of names only contain the correct spelled versions.

The third example shows that NER needs grammatical correct sentences to identify ‘Geert’ as a name. This could be the transcript of a conversation or the interaction in a chat. It shows how NER works good with written language but has trouble understanding spoken (like) language.

And in the last example FWL removes the word ‘duiven’ since it does not only describe the animal (duiven is Dutch for pigeons) but is also the name of a city.

The privacy filter code on Github contains both approaches and during initialisation it is possible to choose the NER approach or the FWL approach. We did not touch the regular expressions in this article, but selecting the NER approach will also execute the regular expressions (NER does not recognize and replace URL’s etc). It also contains some example texts to use and filter to see the differences between both approaches in real life us cases.

Final words

This article, and the previous, describe two approaches to remove personal information in text. Both approaches have their strong sides and weaknesses and it is not possible to choose one approach for all use cases. Removing more privacy information also results in removing more non-privacy information and thereby reducing the value of the filtered text. NER is more accurate in removal of identified privacy information but requires well formed sentences to operate. For maximum security it is even possible to combine both approaches. Feel free to experiment with the implementation on Github.

I hope you enjoyed this article. For more inspiration check some of my other articles:

Disclaimer: The views and opinions included in this article belong only to the author.


Related Articles