A Little spaCy Food for Thought: Easy to use NLP Framework

Haydar Özler
Towards Data Science
5 min readMay 16, 2019

--

In the article below you will read how to start working with spaCy in a quick and easy way. It is especially useful for beginner enthusiasts in NLP area with a step-by-step instructions and bright examples.

spaCy is an NLP Framework, released in February 2015 by Explosion AI. It is known to be the fastest in the world. Being easy to use and having ability to use neural networks are its’ other advantages.

Step-1: Install spaCy

Open your terminal (command prompt) and write:

pip install spacy

Step-2: Download Language Model

Write the following command (still in your terminal):

python -m spacy download en_core_web_lg

The model (en_core_web_lg) is the largest English model of spaCy with size 788 MB. There are smaller models in English and some other models for other languages (English, German, French, Spanish, Portuguese, Italian, Dutch, Greek).

Step-3: Import Library and Load the Model

You are ready to have some NLP fun after you write the following lines in your python editor:

import spacy
nlp = spacy.load(‘en_core_web_lg’)

Step-4: Create Sample Text

sample_text = “Mark Zuckerberg took two days to testify before members of Congress last week, and he apologised for privacy breaches on Facebook. He said that the social media website did not take a broad enough view of its responsibility, which was a big mistake. He continued to take responsibility for Facebook, saying that he started it, runs it, and he is responsible for what happens at the company. Illinois Senator Dick Durbin asked Zuckerberg whether he would be comfortable sharing the name of the hotel where he stayed the previous night, or the names of the people who he messaged that week. The CEO was startled by the question, and he took about 7 seconds to respond with no.”doc = nlp(sample_text)

Step-5: Splitting Sentences of a Paragraph

Let’s split this text into sentences and write character length of each sentence at the end of it:

sentences = list(doc3.sents)
for i in range(len(sentences)):
print(sentences[i].text)
print(“Number of characters:”, len(sentences[i].text))
print(“ — — — — — — — — — — — — — — — — — -”)

Output:

Mark Zuckerberg took two days to testify before members of Congress last week, and he apologised for privacy breaches on Facebook.
Number of characters: 130
-----------------------------------
He said that the social media website did not take a broad enough view of its responsibility, which was a big mistake.
Number of characters: 118
-----------------------------------
He continued to take responsibility for Facebook, saying that he started it, runs it, and he is responsible for what happens at the company.
Number of characters: 140
-----------------------------------
Illinois Senator Dick Durbin asked Zuckerberg whether he would be comfortable sharing the name of the hotel where he stayed the previous night, or the names of the people who he messaged that week.
Number of characters: 197
-----------------------------------
The CEO was startled by the question, and he took about 7 seconds to respond with no.
Number of characters: 85
-----------------------------------

Step-6: Entity Recognition

Entity recognition performance is an important evaluation criteria for an NLP model. spaCy achieves it with one line of code and quite succesfully:

from spacy import displacy
displacy.render(doc, style=’ent’, jupyter=True)

Output:

Step-7: Tokenization and Part-of-speech Tagging

Let’s tokenize the text and see some attributes of each token:

for token in doc:
print(“{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}”.format(
token.text,
token.idx,
token.lemma_,
token.is_punct,
token.is_space,
token.shape_,
token.pos_,
token.tag_
))

Output:

Mark	0	mark	False	False	Xxxx	PROPN	NNP
Zucker. 5 zucker. False False Xxxxx PROPN NNP
took 16 take False False xxxx VERB VBD
two 21 two False False xxx NUM CD
days 25 day False False xxxx NOUN NNS
to 30 to False False xx PART TO
testify 33 testify False False xxxx VERB VB
before 41 before False False xxxx ADP IN
members 48 member False False xxxx NOUN NNS
of 56 of False False xx ADP IN

Again it is very easy to apply and gives immediate satisfying results. A brief explanation about the attributes I have printed out:

text: token itselfidx: starting byte of the tokenlemma_: root of the wordis_punct: is it a punctuation symbol or notis_space: is it a space or notshape_: shape of the token to show which letter is the capitalpos_: the simple part of speech tagtag_: the detailed part of speech tag

What is speech tag?

It is the process of assigning tags to each token like noun, verb, adjective, after splitting the whole text into the tokens.

Step-8: There are only numbers

While we are dealing with languages and texts, where do numbers come from?

Since machines need to convert everything into numbers to understand the world, each word is represented by an array (word vector) in NLP world. Here is the word vector of “man” in spaCy dictionary:

[-1.7310e-01,  2.0663e-01,  1.6543e-02, ....., -7.3803e-02]

Length of spaCy’s word vectors are 300. It can be different in other frameworks.

After word vectors are established, we can observe that contextually similar words are also mathematically similar. Here are some examples:

from scipy import spatialcosine_similarity = lambda x, y: 1 — spatial.distance.cosine(x, y)print(“apple vs banana: “, cosine_similarity(nlp.vocab[‘apple’].vector, nlp.vocab[‘banana’].vector))print(“car vs banana: “, cosine_similarity(nlp.vocab[‘car’].vector, nlp.vocab[‘banana’].vector))print(“car vs bus: “, cosine_similarity(nlp.vocab[‘car’].vector, nlp.vocab[‘bus’].vector))print(“tomatos vs banana: “, cosine_similarity(nlp.vocab[‘tomatos’].vector, nlp.vocab[‘banana’].vector))print(“tomatos vs cucumber: “, cosine_similarity(nlp.vocab[‘tomatos’].vector, nlp.vocab[‘cucumber’].vector))

Output:

apple vs banana:  0.5831844210624695
car vs banana: 0.16172660887241364
car vs bus: 0.48169606924057007
tomatos vs banana: 0.38079631328582764
tomatos vs cucumber: 0.5478045344352722

Impressive? When two fruits or vegetables or two vehicles are compared, similarities are higher. When two irrelevant objects like a car vs a banana, are compared similarity is very low. When we check the similarity of tomatoes and bananas, we observe that it is higher than the one of a car vs a banana but lower than both tomatoes vs cucumbers and an apple vs a banana couples which reflects the reality.

Step-9: king = queen + (man — woman)?

If everything is represented by numbers and if we can calculate similarities in a mathematical way, can we make some other calculations? For example, if we subtract “woman” from “man” and add the difference to “queen”, can we find “king”? Let’s try:

from scipy import spatial

cosine_similarity = lambda x, y: 1 — spatial.distance.cosine(x, y)

man = nlp.vocab[‘man’].vector
woman = nlp.vocab[‘woman’].vector
queen = nlp.vocab[‘queen’].vector
king = nlp.vocab[‘king’].vector
calculated_king = man — woman + queenprint(“similarity between our calculated king vector and real king vector:”, cosine_similarity(calculated_king, king))

Output:

similarity between our calculated king vector and real king vector: 0.771614134311676

Not bad, I think. You can try it with different word alternatives and will observe similar promising results.

In Conclusion;

Purpose of the article was to make an easy and brief introduction to spaCy framework and show some simple NLP application examples. Hopefully it was heplful. You can find detailed information and lots of examples in their very well designed and informative website, https://spacy.io/.

If you have any further questions, please don’t hesitate to write: haydarozler@gmail.com.

--

--