Exploring the usage of NEs in a Dutch news data set
 by [NOS](https://nos.nl/) Photo by Rick L on [Unsplash](https://unsplash.com/photos/ZnBBDPO2mbQ?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) (left), image by author, created in DreamStudio (middle), article by NOS, Photo by Cristina Anne Costello on Unsplash (right).](https://towardsdatascience.com/wp-content/uploads/2023/07/1Jb9o1H9HWUWSr7x6fw4IyA.png)
At NOS – the Dutch Public Broadcasting Foundation – every day hundreds of news articles are written by our editorial teams. These articles inform Dutch citizens about the news, but also make for an interesting and high quality data set from a Natural Language Processing point of view. In this blog I, as the Data Scientist at NOS, report on several experiments performed by applying Named Entity Recognition (NER) to our data set of Dutch news articles and present several ideas on how NER can be applied within the context of news.
What are Named Entities?
A named entity (NE) is a special type of word that refers to real-world objects with proper names, for example persons, locations or organisations. Models exist that automatically recognise these types of words, which are called Named Entity Recognition (NER) models. An example of such a NER model applied to an excerpt of one our articles is shown in the figure on the right below, where the NE are highlighted and annotated with the NE type.
In Dutch, a few pre-trained models are available such as spaCy [1], Flair [2] or NTLK [3]. We performed a qualitative evaluation on these three models, by means of applying them to a random sample of our articles and manually inspecting the results. From this we decided to use spaCy for the remainder of our experiments. An overview of all NE types that this model may recognise is presented in the Figure 1 below on the left.

Using the pre-trained model from spaCy, we applied NER to several subsets of our data set. We started by collecting all articles for a single month (February 2023), split the data into the categories news and sport (1.030 and 596 articles respectively), and then applied NER to obtain the total frequency counts per NE type. The results for news and sports are shown in Figure 2, and immediately showcase the significance of NEs in news. It can be seen that in just one month of articles, tens of thousands NEs are mentioned in the articles. To put this in perspective, on average an article contains 404 words, and about 10% of words in articles are NEs. It can also be seen in the figures below that the most frequently mentioned NE types differ for news and sport. For news the majority of NE types are countries, followed by organisations and persons. While for sport the most frequent NE type is persons, followed by countries and numerals. This might be explained by sport mentioning scores (cardinal) and individual athletes (person), while news covers events for which it is often relevant to mention the location (gpe).


NER providing a new point of view on our data
We performed a case study using all articles on the World Cup football 2022, consisting of 482 articles in total. NER was applied to the data set to detect all NEs with the type Person. 2.171 unique NEs were found, of which 1.296 were mentioned just once. In Figure 3A we present an overview of the most frequently mentioned persons during this event. Additionally, for the most frequently mentioned persons, we created a streamgraph to show how the mention frequencies develop over time, as can be seen in Figure 3B. This for instance shows that van Gaal is mentioned frequently throughout the whole tournament, while others are mentioned mostly on specific days. These kind of graphs may provide our editorial teams with new kinds of insights, as they are quantitative reflections of what the NOS writes about. Such insights are efficiently powered by NER. For now we applied this specifically for the World Cup 22, but one can come up with many different settings where these kinds of graphs may be interesting. For instance think about which politicians or political parties are mentioned during elections, or more generally, the mention frequencies of countries, cities, organisations and so on for a larger time range.


Everything about [YOUR NAMED ENTITY HERE]
We took the case study using all articles on the World Cup 2022 one step further and posed ourselves the question "Can we use NER to generate summaries for a Named Entity?". __ We started by developing a module that collects all articles mentioning a given NE, which could serve as a collection of all information available on the given NE for users particularly interested in this NE. But, more interestingly, the module collects all sentences from this collection in which the NE is mentioned, resulting a summary of the collection. As an example, we applied the module for _Andries Nopper_t, the goalkeeper for the Dutch national team. From Figure 3 it can already be seen that Noppert has been mentioned quite frequently during the event. Applying the module for Noppert resulted in a summary that quite nicely outlines the remarkable story of our goalkeeper, which is shown below as translated from Dutch.
-------------------------------------------------- -------------------------------------------------- --------------------
2022-11-11
- Noppert joining as a penalty killer?
-------------------------------------------------- -------------------------------------------------- --------------------
2022-11-16
- sc Heerenveen goalkeeper Andries Noppert is the nineteenth premier league player in Qatar.
-------------------------------------------------- -------------------------------------------------- --------------------
2022-11-20
- 'Don't worry about Qatar and Ecuador' and 'Failure on goal is a gamble' Analysts Leonne Stentler and Pierre van Hooijdonk agree.
- Van Gaal does not say anything about Noppert's base place, but hints at Gakpo 'at 10' According to various media, 28-year-old Andries Noppert, who plays for sc Heerenveen, would make his debut for the Orange squad against Senegal on Monday.
-------------------------------------------------- -------------------------------------------------- --------------------
2022-11-21
- Is Noppert the base goalkeeper now?
- Noppert: 'This is what you dream of as a boy' Goalkeeper Andries Noppert turned out not to suffer from stage fright against Senegal.
- Will Noppert succeed first World Cup debutant Schoenaker?
- Goalkeeper Andries Noppert makes his debut in Orange and can look back on a successful first international match.
-------------------------------------------------- -------------------------------------------------- --------------------
2022-11-22
- 'Disarming' Noppert takes the stage: 'In the Netherlands we are all whining' The 28-year-old goalkeeper of sc Heerenveen made his debut on Monday in the World Cup match against Senegal in the Dutch national team.
-------------------------------------------------- -------------------------------------------------- --------------------
2022-11-23
- Noppert?
-------------------------------------------------- -------------------------------------------------- --------------------
2022-11-24
- The Foggia episode of Orange keeper Noppert: 'He smoked like a chimney' Andries Noppert is suddenly a well-known Dutchman after the World Cup match of the Netherlands against Senegal.
-------------------------------------------------- -------------------------------------------------- --------------------
2022-11-25
- Jurriën Timber, Virgil van Dijk and Nathan Aké had their defenses well organized and Andries Noppert once again proved to be a reliable goalkeeper.
-------------------------------------------------- -------------------------------------------------- --------------------
2022-12-03
- View the reactions of Virgil van Dijk and Andries Noppert here: In that team, one of the important players is just back in his familiar spot in the attack.
- Andries Noppert made a good save with his left leg.
-------------------------------------------------- -------------------------------------------------- --------------------
2022-12-07
- Noppert lives soberly towards Argentina: 'Messi can also miss penalties, can't he?'
-------------------------------------------------- -------------------------------------------------- --------------------
2022-12-09
- So yes..." Noppert's fairy tale ended It could have been so beautiful for sc Heerenveen goalkeeper Andries Noppert, but the keeper on the other side, Emiliano Martinez, became the great hero.
- The Argentinian wingback Molina ran away from the back of his Dutch colleague Blind, Virgil van Dijk was just too late to correct and Molina passed Andries Noppert.
-------------------------------------------------- -------------------------------------------------- --------------------
2022-12-16
- Six striking World Cup facts: Amrabat conquers, Modric dribbles, Noppert saves Remarkable statistics everywhere during the World Cup in Qatar.
-------------------------------------------------- -------------------------------------------------- --------------------
2022-12-18
- Andries Noppert (Netherlands) Vermeulen: "The same goes for Noppert, of course.
A NE-aware recommendation system
So far we have seen that NEs are abundant in news articles and that applying NER can provide some interesting insights. There is one more experiment that we think is interesting to share in this blog relating to the research question "Can we use NER to improve our content-based recommendation system?". Earlier on we developed a content-based recommendation system which was recently integrated into our news app. Using both online and offline tests we compared various models and optimisations, and we now observe an increased click-through-rate in our app. This is all great news, but we are always looking for ways to improve our recommendation system further. We received feedback from our editorial teams stating the recommender is confused for articles containing names of persons or places that are also regular words in the Dutch language. In the following section we report on an experiment using NER in an attempt to solve for this type of ambiguity.
The experiment
Our current recommendation system is based on cosine similarities using TF-IDF to vectorize texts. This basically means it relies heavily on word overlap to identify similar articles, but assigns higher relevance to words that are rare. One can imagine this method does not hold up when words have multiple meanings, which can be the case for NEs. As an example consider an article about the golfer Tiger Woods: a basic recommendation system might find related articles mentioning the animal tiger or articles about the woods. These would obviously not be useful recommendations. We hypothesised that this could be solved by introducing NE-awareness in our recommender by means of annotating NEs in texts by their type. In this case, the tokens would no longer overlap, as illustrated in Figure 5.
s because of the word 'tiger' being mentioned in both articles, while the NE-aware system resolves this ambiguity. Source: article by [NOS](https://nos.nl/), Photo by Rick L on [Unsplash](https://unsplash.com/photos/ZnBBDPO2mbQ?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) (left), article by NOS, Photo by Cristina Anne Costello on Unsplash (right).](https://towardsdatascience.com/wp-content/uploads/2023/07/12tGO3n8d1bQR44559FMdYw.png)
We implemented NE-awareness using the NE types person, location, organisation and a combination of all these. We evaluated the various models using a test set that was manually annotated by our editorial teams and contains information on which articles are related. This test set contains 14.541 unique articles, and on average each article is linked to about 2 other articles. As an evaluation metric, we calculated the average rank of curated linked articles within the sorted recommendations.
Figure 6 shows the results for our base model and the various NE-aware models. It can be seen that actually our base model outperforms all types of NE-aware models. In theory introducing NE-awareness would improve the recommender, but in practice we see that it introduces more ambiguity than it solves. We looked into the output of the various models in detail, and saw that we are limited by the performance of the NER model. The spaCy NER model as evaluated on their own test set yields an F-score of 0.77, but this score may be lower when applied to another data set, so one can expect the model to be inaccurate occasionally. From a manual inspection of some recommendations output by the NE-aware models we saw that in combination with TF-IDF the effect of incorrectly detected NEs is quite strong. For articles with incorrectly detected NE, the output recommendations often contain the same incorrectly detected NE. We saw for example an article containing the word hindsight which was classified as a NE of type Person, resulting in recommendations that contained the same incorrectly classified NE hindsight. While NER is off in this case, the recommendations make sense, because TF-IDF will assign a higher relevance to tokens like _hindsightPerson as they very rare in the corpus. Our conclusion is that the pre-trained Dutch NER models are at this point not accurate enough to be incorporated into our recommendation system.

We might benefit from finetuning the pre-trained models ourselves in the future. For now we explored another approach to solve for NE-ambiguity by using metadata such as categories and keywords as a noiseless but less strongly related measure for NE, which improved our recommender quite a bit.
Conclusion
In this blog we explored what can be done with Named Entity Recognition when applied to a Dutch news data set. We saw it works well when applied for deriving general insights about the data set such as the construction of NE frequency plots and streamgraphs. However, when applied for our recommender we saw the models were not accurate enough. While introducing NE-awareness resolved some NE-ambiguity, it simultaneously introduced new ambiguity in the form of errors in NE detection. In the future we might experiment with finetuning a pre-trained model or training our own model from scratch, or if you have any suggestions yourself, let us know in the comments!
All images unless otherwise noted are by the author.
References[1] spaCy NER model: https://spacy.io/models/nl#nl_core_news_lg [2] Flair NER model: https://huggingface.co/flair/ner-dutch-large [3] NLTK NER model: https://www.nltk.org/book/ch07.html
About the NOSNOS is an independent public media organisation in the Netherlands reporting on news and sports through platforms such as television, radio, websites and mobile apps. We have dedicated teams of professionals to create digital services for several brands. The research described in this blog was performed as a member of the NOS data team, who as a team are responsible for exploring the usage of novel Data Science and AI techniques for the context of news.