Named Entity Recognition NER using spaCy | NLP | Part 4

Text Processing using spaCy | NLP Library

Ashutosh Tripathi
Towards Data Science

--

Named Entity Recognition is the most important, or I would say, the starting step in Information Retrieval. Information Retrieval is the technique to extract important and useful information from unstructured raw text documents. Named Entity Recognition NER works by locating and identifying the named entities present in unstructured text into the standard categories such as person names, locations, organizations, time expressions, quantities, monetary values, percentage, codes etc. Spacy comes with an extremely fast statistical entity recognition system that assigns labels to contiguous spans of tokens.

Spacy Installation and Basic Operations | NLP Text Processing Library | Part 1

Spacy provides an option to add arbitrary classes to entity recognition systems and update the model to even include the new examples apart from already defined entities within the model.

Spacy has the ‘ner’ pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the ‘ents’ property of a Doc object.

# Perform standard imports import spacy nlp = spacy.load('en_core_web_sm')# Write a function to display basic entity info: def show_ents(doc): if doc.ents: for ent in doc.ents: print(ent.text+' - ' +str(ent.start_char) +' - '+ str(ent.end_char) +' - '+ent.label_+ ' - '+str(spacy.explain(ent.label_))) else: print('No named entities found.')doc1 = nlp("Apple is looking at buying U.K. startup for $1 billion") show_ents(doc1)
doc2 = nlp(u'May I go to Washington, DC next May to see the Washington Monument?') show_ents(doc2)

Here we see tokens combine to form the entities next May and the Washington Monument

doc3 = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?') for ent in doc3.ents: print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)

Entity Annotations

Doc.ents are token spans with their own set of annotations.

Accessing Entity Annotations

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value using ent.label or as a string using ent.label_.

The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

doc = nlp("San Francisco considers banning sidewalk delivery robots") # document level for e in doc.ents: print(e.text, e.start_char, e.end_char, e.label_) # OR ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] print(ents)#token level # doc[0], doc[1] ...will have tokens stored. ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_] ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_] print(ent_san) print(ent_francisco)
IOB SCHEME I - Token is inside an entity. O - Token is outside an entity. B - Token is the beginning of an entity.

Note: In the above example only San Francisco is recognized as a named entity. hence rest of the tokens are described as outside the entity. And in San Francisco San is the starting of the entity and Francisco is inside the entity.

Tags are accessible through the .label_ property of an entity.

User-Defined Named Entity and Adding it to a Span

Normally we would have spaCy build a library of named entities by training it on several samples of text.
Sometimes, we want to assign a specific token a named entity which is not recognized by the trained spacy model. We can do this as shown in the below code.

Example 1

Example 2

Adding Named Entities to All Matching Spans

What if we want to tag all occurrences of a token? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc:

doc = nlp(u'Our company plans to introduce a new vacuum cleaner. If successful, the vacuum cleaner will be our first product.') show_ents(doc) #output: first - 99 - 104 - ORDINAL - "first", "second", etc.#Import PhraseMatcher and create a matcher object: from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab)#Create the desired phrase patterns:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]
#Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)
#Apply the matcher to our Doc object:
matches = matcher(doc)
#See what matches occur: matches #output: [(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]
#Here we create Spans from each match, and create named entities from them: from spacy.tokens import Span PROD = doc.vocab.strings[u'PRODUCT'] new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches] #match[1] contains the start index of the the token and match[2] the stop index (exclusive) of the token in the doc. doc.ents = list(doc.ents) + new_ents show_ents(doc)output: vacuum cleaner - 37 - 51 - PRODUCT - Objects, vehicles, foods, etc. (not services) vacuum cleaner - 72 - 86 - PRODUCT - Objects, vehicles, foods, etc. (not services) first - 99 - 104 - ORDINAL - "first", "second", etc.

Counting Entities

While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:

Visualizing NER

Spacy has a library called “displaCy” which helps us to explore the behaviour of the entity recognition model interactively.

If you are training a model, it’s very useful to run the visualization yourself.

You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the webserver, or displacy.render to generate the raw mark-up.

#Import the displaCy library
from spacy import displacy

Visualizing Sentences Line by Line

Viewing Specific Entities

You can pass a list of entity types to restrict the visualization:

Styling: customize colour and effects

You can also pass background colour and gradient options:

This is all about Named Entity Recognition NER and its Visualization using spaCy. Hope you enjoyed the post.

In the next article, I will describe sentence segmentation. Stay tuned!

If you have any feedback to improve the content or any thought please write in the comment section below. Your comments are very valuable.

Previous Articles in spaCy NLP Series:

Thank You!

References:

Originally published at http://ashutoshtripathi.com on April 27, 2020.

--

--