Natural Language Processing: A Beginner’s Guide Part-I

Somesh Routray
Towards Data Science
7 min readFeb 22, 2020

--

Learn NLP with nltk library from scratch

https://unsplash.com/

Expressions contain a huge amount of data. Every time we speak or write it has some interpretations related to specific topics, communities, doctrines, etc. , which can influence people in a larger mass. As a human, it is very difficult to analyze these data, but thanks to Machine Learning and AI it has been made easy!

In the Netizen era, Information is circulating in various social media platforms, e-newspapers in different languages. Collecting those unstructured data and analyze its different interpretations is now possible with the Natural Language Processing.

Natural Language Processing is the field of Artificial Intelligence that deals with the interaction between Machines and Human Languages. Putting it in a different way, NLP helps machines to understand the derive the meaning of human (natural) languages.

USE CASES

The Smart Assistants like Google Assistant, SIRI, Amazon Alexa, etc. work based on the concepts of NLP. These assistants convert your speech into text, analyze it and then put it in actions.

Sentiment Analysis: In social media platforms, the sentiments of people about the tweets or posts you can analyze using NLP. Organizations analyze the customers’ sentiments about a product through feedback processing using NLP and get gain in the business.

Topic Modelling: You can use the LDA ( Latent Dirichlet Allocation) technique to dive into the world of Topic Modelling. I will dicuss more about it in part-2 of this blog.

Spam Detection: Companies use NLP in classifying spam mails that flowing through their servers using NLP.

Fake News Detection: Some institutes use NLP to detect fake news ciculating in social media, e-media .

There are many other use cases are available for NLP. In this blog, you will understand the text analysis of a newspaper article using NLP as a beginner. For this, you can either directly go to the link and copy the article to process or you can scrape using Python. Web scraping is really interesting to do, I will discuss it in another blog.

Let’s do something great with NLP!

https://unsplash.com/

Basic Installation:

For this blog, nltk open-source library and spyder IDE will be handy. This library describes lots of functionalities of NLP. First, you need to import nltk library and then download some requisites from the GitHub repository.

#import library nltk
import nltk
nltk.download()

As soon as you run these things in your spyder IDE, a pop up will appear as below. You need to download all.

output from my pc

TOKENZATION: It is the text processing part that helps you to split the large string to pieces of tokens. In other words, this technique helps you to split the paragraph into a list of sentences and a sentence into a list of words.

paragraph =’’’
Ahead of U.S. President Donald Trump’s visit to India, some of the key deliverables from the trip, as well as the outcomes that may not be delivered after his meeting with Prime Minister Narendra Modi on Tuesday, are coming into view. The larger question remains as to whether the bonhomie between the two, who will be meeting for the fifth time in eight months, will also spur the bilateral relationship towards broader outcomes, with expectations centred at bilateral strategic ties, trade and energy relations as well as cooperation on India’s regional environment. On the strategic front, India and the U.S. are expected to take forward military cooperation and defence purchases totalling about $3 billion. Mr. Trump has cast a cloud over the possibility of a trade deal being announced, but is expected to bring U.S. Trade Representative Robert Lighthizer to give a last push towards the trade package being discussed for nearly two years. Both sides have lowered expectations of any major deal coming through, given that differences remain over a range of tariffs from both sides; market access for U.S. products; and India’s demand that the U.S. restore its GSP (Generalised System of Preferences) status. However, it would be a setback if some sort of announcement on trade is not made. A failure to do so would denote the second missed opportunity since Commerce Minister Piyush Goyal’s U.S. visit last September. Finally, much of the attention will be taken by India’s regional fault-lines: the Indo-Pacific strategy to the east and Afghanistan’s future to the west. India and the U.S. are expected to upgrade their 2015 joint vision statement on the Indo-Pacific to increase their cooperation on freedom of navigation, particularly with a view to containing China. Meanwhile, the U.S.-Taliban deal is expected to be finalised next week, and the two leaders will discuss India’s role in Afghanistan, given Pakistan’s influence over any future dispensation that includes the Taliban.Any high-level visit, particularly that of a U.S. President to India, is as much about the optics as it is about the outcomes. It is clear that both sides see the joint public rally at Ahmedabad’s Motera Stadium as the centrepiece of the visit, where the leaders hope to attract about 1.25 lakh people in the audience. Despite the Foreign Ministry’s statement to the contrary, the narrative will be political. Mr. Trump will pitch the Motera event as part of his election campaign back home. By choosing Gujarat as the venue, Mr. Modi too is scoring some political points with his home State. As they stand together, the two leaders, who have both been criticised in the last few months for not following democratic norms domestically, will hope to answer their critics with the message that they represent the world’s oldest democracy and the world’s largest one, respectively.
‘’’
#tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)
#tokenizing words
words = nltk.word_tokenize(paragraph)

Here paragraph contains a particular newspaper article. Using sent_tokenize() and word_tokenize() you can create list of tokens.

STEMMING: The process of reducing derived words to their word stem or root form.

Negligence    |
Negligently | ============> Negligen
Neglligent |

Here negligence, negligently, negligent these words will be reduced to negligen. but it has a drawback — the stem may not have a meaning.

from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

You need to create a PorterStemmer object and iterate through the sentence tokens to perform stemming.

LEMMATIZATION: It is a similar activity as stemming. But in this case, the base word will have some meaning.

from nltk.stem import WordNetLemmatizer

NOTE:

  1. Every time you perform the Lemmatization or Stemmatizer you need to convert the tokens into lowercase and remove the stopwords.
from nltk.corpus import stopwords

2. For more text cleaning activity if required, you use the Python re module.

https://www.thehindu.com/opinion/editorial/

In the present discussion, the article contains U.S. many times. Since it has more importance related to the article, you need to convert U.S. to ‘America’ using a replace() method. After performing above these function the output will be as below.

output from my PC

In this figure, clearly ‘deliverables’ has been reduced to ‘delive’. It has no meaning. So I performed Lemmatization. Here is the output. Just have a look at the word ‘deliverables’ in 1st line.

output from my pc

Stop Words: nltk supports various languages and each language having different stopwords. for English stopwords are like — ‘a’,’the’,‘where’ etc.

output from my pc

Congratulations!!!

you have successfully lemmatized the sentence tokens!

BAG OF WORDS:

It is a model that helps you to understand the occurrence of words in document or sentence disregarding grammar and order of words.

You can create an object of CountVectorizer to perform this action.

from sklearn.feature_extraction.text import CountVectorizer
output from my pc

Two inferences we can draw from our discussion until this point.

  1. In lemmatization and Stemmatization, all the words have the same importance.
  2. No semantic information is present in the words we have collected.

To overcome these problems another approach is followed — TFIDF.

TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY:

Do you have some information, presented as uncommon words but it has great importance in text analysis?

No worries , TFIDF will help you out!

Copyright -Somesh
Copyright -Somesh

TFIDF = TF * IDF

Again thanks to Sci-Kit Learn library to reduce some of your efforts. Create sn object of TfidfVectorizer to apply this concept.

from sklearn.feature_extraction.text import TfidfVectorizer

Have a glance at the results

These various decimals outputs show the actual importance of the word with respect to the documents/sentences. Again this approach has neither semantics nor contexts. But it is a method that rewards for terms in words but ignores the terms are frequent in other words.

Further, we can make use of topic modeling, analyze the article with other sources with the same context to understand the Sentiments of the writers.

But those things will be discussed in part 2!!!

NLP is a vast and booming area of AI, though it is very challenging to work on this. From Smart assistants to disease detection and also in various domains NLP is very useful.

Congratulations and welcome to all of you in the Universe Of Natural Language Processing!!!

Thanks to Krish Naik for whom this journey is made possible.

For suggestions, I will be available on LinkedIn , Gmail and follow my work at GitHub.

--

--