NLP Easy explanation of common terms with python

Published in

Towards Data Science

6 min readOct 26, 2020

Natural language processing (NLP) is a field in which data science deals with linguistics, and artificial intelligence concerned with the interactions between computer systems and human language so as to interpret and analyze natural language in systems, this is an expanding field in data science where various techniques are applied to analyze large amounts of natural language data.

The most popular library used for this work is nltk which can be imported by following line of code.

import nltk

Few common terms that are used when we talk about NLP models are given below with their meanings and implementation.

Tokenization: Breaking down text into smaller parts is called tokenization. It can be chunking paragraphs into sentences and sentences to words.

# Tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)
 
# Tokenizing words
word = nltk.word_tokenize(paragraph)

Paragraph splitted to sentence[image by author]

paragraph splitted to words[image by author]

2) Stemming: Reducing many similar words to a stem word or base word is called Stemming.

Most popular stemmer used in English language is called Porter Stemmer, which is a library of nltk.

Example of Stemmer function is as under:

Advantage: This is fast algorithm and can be used in models where speed is required.

Disadvantage: It doesn’t take into consideration the meaning of the stem function, but just reduces it to stem.

In python it is done as:

#import
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]

3) Lemmatization refers to reducing the similar words into its base word which has meaning, i.e. it takes into consideration use of a vocabulary and morphological analysis of words, which aims at removal of inflectional endings only and returning the base which is well defined in a dictionary form, called lemma.

Example is as under:

Advantage: It is mostly used in chat bots as giving meaningful responses is main idea of it.

Disadvantage: It is slower than stemming and where time is main consideration is time.

#import
from nltk.stem import WordNetLemmatizer
wordnet=WordNetLemmatizer()
review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]

4) Stop Words: Words that are not very important in language processing can be removed before applying any model to it, or before processing it for sentiments. These words like is, an, you, the, can be called stop words and can be imported from nltk.corpus as ‘nltk.corpus import stop words’.

In Python:

#import
from nltk.corpus import stopwords
review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]

5) Bag of Words This is a way of representing and processing words in machine learning algorithms. It represents occurrence and frequency of word. It is a model used in natural language processing which is represented as the bag (multiset) of its words, not taking into consideration grammar and even word order but keeping the number of its occurrence.

Advantage:

Its fast and frequency is taken into consideration. Easy to implement.

Disadvantage:

It doesn’t represent data into information, i.e. the meaning of words is lost while doing it. It assumes all words are independent of each other.

Suitable only for small data.

Example:

Sentence 1: She is a very good and decent woman, she is also a good artist.
Sentence 2: He is a bad man but a good driver.
Sentence 3: Man and woman are equal in a decent society.

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(paragraph).toarray()

6) TFIDF: Term Frequency-Inverse Document Frequency, is a statistical formula in which it is evaluated that how relevant a word is in a text in a collection of texts. It is obtained by product of two metrics, term frequency and inverse document frequency.

Term Frequency= No. of repetition of words in a sentence/No of words in a sentence.

Inverse Document Frequency=log(No. of sentence/No. of sentence containing words)

Example:

Sentence 1: She is a very good and decent woman, she is also a good artist.

Sentence 2: He is a bad man but a good driver.

Sentence 3: Man and woman are equal in a decent society.

By removing stop words from it, sentences can be.

Sentence 1: very good decent woman good artist.
Sentence 2: bad man good driver.
Sentence 3: Man woman equal decent society.

Its calculation is derived as below:

Advantages: Easy to compute and implement. It can give some basic metrics to extract the most descriptive terms in a text. it can easily compute the similarity between 2 texts using it. Search engine can use it.

Disadvantages: TF-IDF doesn’t capture semantics or position of occurrence of words in a text.

In Python this can be done as:

# Creating the TF-IDF 
from sklearn.feature_extraction.text import TfidfVectorizer
cv=TfidfVectorizer()
X=cv.fit_transform(paragraph).toarray()

7) Word2Vec is a technique for natural language processing (NLP). The word2vec algorithm uses a neural network model to learn word semantics and its associations from a large corpus of text. Once trained, such a model can detect similar words or can let us know additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector.

The vectors are chosen carefully such that a simple mathematical cosine between vectors indicates the level of semantic similarity between the words represented by those vectors.

Advantages:

This transforms the unlabeled raw corpus into labelled data by mapping the target word to the word it has contextual relation with, thus it learns the representation of words in a classification model.

The mapping between the target word to its contextual relation word embeds the sub-linear relationship into the vector space of words, so that relationships like “ king: man as queen: woman “ can be inferred by word vectors.

Easy to understand and implement.

Disadvantages:

The sequence of words is lost and hence sub-linear relationships are not very well defined.

The data has to be fed to the model online and may need pre-processing, which requires memory space.

The model could be very difficult to train if the number of categories is too large, i.e. corpus is too big and vocabulary is too large.

In Python it can be implemented as:

from gensim.models import Word2Vec
from gensim.models import KeyedVectorssentences = nltk.sent_tokenize(paragraph)
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]# Training the Word2Vec model
model=Word2Vec(sentences, min_count=1)
words=model.wv.vocab

# Most similar words
similar=model.wv.most_similar('woman')#Output[ ('driver', 0.15176300704479218),
 ('artist', 0.06272515654563904),
 ('good', 0.0425836481153965),
 ('man', -0.0059792473912239075)]

Conclusion:

It is important to note that each technique was developed after one discovered shortcoming of it, like lemmatization is better to stemming as it can convert similar words to some meaningful word. TFIDF is better than bag of words as it can contain information better than it. Word2Vec is better than TFIDF as it can predict contextual relations between words.

Again in all these techniques semantics or the order of words is lost for which RNN or Recurrent Neural Networks came into existence, wherein all important information as per order of words is kept intact and sentiment analysis is done in an amazing way.

Hope by reading this blog, many terms must be clear and NLP may seem easier.

Thanks for reading!

Originally published at https://www.numpyninja.com on October 26, 2020.

NLP Easy explanation of common terms with python

Written by Namrata Kapoor