
The GDPR is the General Data Protection Regulation by the European Union. Its purpose is to protect data of all European residents. Protecting data is also an intrinsic value of a developer. Protecting data in a row/column data structure is relative easy by controlling access to columns and rows. But what about free text?
In order to fulfil our privacy requirements we can adapt the content of a free text field en replace privacy related information by tags. The meaning of the text is not altered but it cannot be related to an individual through anonymization. The goal is translate the following text (date is Dutch):
The possibilities have increased since 2014, especially compared to2012, hè Kees? The system has different functions to manipulate data. The date is 12–01–2021 (or 12 jan 2021 or 12 januari 2021). You can reach me at [email protected] and I live in Rotterdam. My address is Maasstraat 13, 1234AB. My name is Thomas de Vries and I have Acne. Oh , I use ranitidine for this.
and replace it with
The possibilities have increased since , especially compared to, hè ? The system has different functions to manipulate data. The date is (or or ). You can reach me at and I live in . My address is , . My name is and I have . Oh , I use for this.
This article describes a simple privacy filter that will perform the following actions:
- Replace dates with the tag
- Replace an URL with the tag
- Replace email addresses with
- Replace Postal codes with
- Replace numbers with
- Replace cities and regions with
- Replace street names with
- Replace first and last names with
- Replace diseases with
- Replace medicine names with
The last two are added since medical information requires extra care. The number of occurrences will be low but the impact is big when this information is leaked.
The first four action will be performed with Regular Expressions while the last five will be implemented by a replacement function. Our privacy filter class has the following structure:
The class PrivacyFilter implements the different filters. After creation and initialisation the object can be used to filter text. It works with regulaor expressions and the FlashText WordProcesser.
Filtering with regular expressions
The first four filters are implemented with regular expressions. Replacing numbers is the first, most simple, replacement:
This regular expression replaces all words that contain one or more digits by the tag . This will replace bank accounts, phone numbers, ID numbers, etcetera in the text. This filter is executed last so postal codes and dates can be replaced by their appropriate tag instead of a serie of number tags.
A bit more advanced is the function to remove postal codes. Postal codes in the Netherlands have the form 0000AA with an optional space between the numbers and the letters. To replace these the following regular expression is used:
The optional part with punctuation marks is added to prevent that a sequence of four numbers with the first two letters of a word will be replaced, e.g. we do not want to replace ‘order of 4000 items’ with ‘order of ems’.
Removal of email addresses becomes a bit more tricky, due to the more complex nature of email addresses:
The regular expression is found on the website Email regular expression that 99.99% works. The implementation of a email checker in various languages can be found there. Another good source for regular expressions is Murani.nl.
The removal of dates is not possible with one regular expression since months can be written as numbers, abbreviations and with full names. To remove dates we need three regular expressions:
The first regular expression matches dates written as numbers in the form dd-mm-yyyy. Different separators between the date parts are supported. The second and third match dates with the name of the month in text.
Filtering with KeyWordProcessor
Filtering on places, streets, names, medicines and diseases require thousands of regular expressions if built like the previous set of replacements. Even combining series of names in one regular expression is expensive.
To solve this problem, Alfred V. Aho implemented the Aho-Corasick algorithm which locates strings stored in a dictionary like structure. A graph is created from all search terms and this graph is traversed will parsing the text.

This graph contains the string "AB", "ABEF", "AC" and "BD" as only the blue nodes are end nodes. When the first letters are "AB" it is an end node unless the letters "C" and "E" follow. For use in the KeywordProcessor the replacement tags are associated with the end nodes in the graph. This way, all different privacy elements can be added to one graph and still be replaced by the appropriate tag.
There are several implementations of this algorithm available and here we will use the Flashtext implementation from Github. The algorithm is described in Replace and Retrieve Keywords in Documents at Scale. It contains a KeywordProcessor to which keywords are added with their replacement: _keywordprocessor.addkeyword(‘keyword’, ‘replacement’). The end nodes store the replacement to put in place.
In the dataset folder there are several files with a keyword per line, for example a file with all first names, or at least the 10.000 most common ones. We can add all the elements in this file to the graph with the replacement tag as follows:
In the constructor a KeywordProcessor is created that is case sensitive. We use a case sensitive processer since several names are also verbs in the Dutch language. This way we only replace them when they start with a capital as in the input file. If you want to be more secure, you can use a case insensitive processor.
The input file is read to a list (lines 5 and6), duplicates are removed from this list (line 7) and the list is filtered on a minimum length. Each item in the list is added to the processor (and thus the graph) with the appropriate tag "".
In the initialize function more datafiles can be added for street names, places, last names, medicines etcetera.
Location names are filtered by size, because the data is extracted from OpenStreetMap and empty fields, zero length fields and short abbreviations are in the obtained dataset. The mimimun size can be tailored to your requirements on safety.
Filter the text
With all functions in place, we can write the actual filter method:
The regular expression based methods are called, followed by the case sensitive and case insensitive processors. Since the different datasets are integrated in the KeywordProcessors, only one execution is needed. This results in the required output.
But what about performance? Replacing textparts can become very expensive, especially with the amount of forbidden words, in this case approx. 136.000 (!!!). On my computer initialization of the class takes 3.1 seconds, but filtering the text presented earlier takes a mere 0.5 msec. That is fast! That is fast enough to use in actual use cases.
Final thoughts
This article presents a simple but very effective privacy parser for free text. Improvements are always possible but this code is a best effort approach in filtering privacy information out of a text.
Improvements can be made by replacing the algorithm with a tokenizer. This enables the possibility to introduce the Levenshtein function to measure distances between words and thus supporting the removal of words with typing errors.
The complete code can be found on Github: https://github.com/lmeulen/PrivacyFilter
The tags and example sentence are Dutch but the source code can easily be adopted to other languages. In the repository there is also a program to collect different datasets for the Dutch language. Note that these add a first row to the datafiles with the name of the data. The PrivacyFiler class filters this first row when reading the data files.
I hope you enjoyed this article. For inspiration check some of my other articles:
- Parallel web requests with Python
- Visualize the crowdedness of Dutch trains with Open Data and Kepler
- Visualization of travel times with OTP and QGIS
- All public transport leads to Utrecht, not Rome
Disclaimer: The views and opinions included in this article belong only to the author.