Top 5 Natural Language Processing Python Libraries for Data Scientist.

A Complete Overview Of popular python libraries for Natural Language Processing in a Non-verbose Manner.

Sri Manikanta Palakollu
Analytics Vidhya

--

Around more than 70 percent of the data available on the internet is not in a structured format. since data is very essential organ for the data science, researchers are really worked hard to push out our limits from structured data processing to unstructured data processing.

Unstructured Data includes sensors data, images, video files, audio files, websites, and API’s data, social media data, emails, and many more text related information.

Due to its typical nature, we cannot able to process that data in an simpler way to use it in an application. To avoid this problem in big data and data science environment there are many techniques and tools are came into existence to solve this problem.

One of the advanced technology which is grabbed from the artificial intelligence world is natural language processing. It seeks to understand meaning and context in text and human speech, increasing with the aid of deep learning algorithms that use neural networks to analyze the data.

Natural Language Processing : [NLP]

It is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable. Most NLP techniques rely on machine learning to derive meaning from human languages.

Famous applications that build upon using NLP technology are —

  1. Google Speech Recognition.
  2. Apple’s Siri.
  3. Microsoft’s Cortona.
  4. Amazon’s Alexa.
  5. Ok Google in Android Mobiles.
  6. Interactive voice responses used in call centers.
  7. Grammarly [A famous spell checker used by many authors and organizations].

The applications are endless this technology is increasing rapidly day by day based on research and development. Application-based systems are coming out that uses the core features of NLP to solve our real-world problems.

Let’s see what are the mostly used libraries that are available in python to build our own NLP model —

spaCy

Extremely optimized NLP library which is meant to be operated together with deep learning frameworks such as TensorFlow or PyTorch.

Source: google.com

It is an advanced NLP library available in Python and Cython. It helps us to do the very real quick development for an application. spaCy comes with pre-trained statistical models and word vectors, and currently supports tokenization for 50+ languages. It features state-of-the-art speed, convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration.

Implementation Example

# Importing the Library
import spacy

NLPLagnguageType = spacy.load('en')
text = NLPLanguageType('Earth Revolves around the sun.')
# Iterating over the text
for token in text:
print(token.text,token_pos_)

Explanation and Output:

In the above example, we have implemented a simple NLP example that takes the text as input and produces the parts of speech[POS] for each word

>>> Earth NOUN

>>> Revolves VERB

and the list goes on. It will generate the parts of speech for each word in the given text.

Gensim

Gensim is a Python library for topic modeling, document indexing and similarity retrieval with large corpora.

The Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Features of Gensim are —

  1. All algorithms are memory independent.
  2. Efficient implementation of popular algorithms.
  3. Distributed computing can run latent semantic analysis and latent Dirichlet allocation on a cluster of computers.
  4. Intuitive Interfaces.

Implementation Example

import gensimid2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')

lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10, update_every=1, chunksize=10000, passes=1)

lda.print_topics(1)

Explanation:

Above you can see a very simple example of extracting topics from a Wikipedia dump via LDA. The disadvantage of this library is we cannot able to figure it out how to load/build vectorized corpora and dictionaries from (plain) text data.

Pattern

It is a data mining library for python which is used to crawl and parse a variety of sources such as Google, Twitter, Wikipedia, and many more.

It comes with various NLP tools (PoS tagging, n-Grams, sentiment analysis, WordNet), machine learning capabilities (vector space models, clustering, classification), and various tools for conducting network analysis. It is maintained by CLiPS and hence there are not only good documentation and many examples, but also a lot of academic publications that are making use of the library.

Implementation Example

# Importing Libraries
from pattern.web import Google
from pattern.en import ngrams

engine = Google(license=API_LICENSE)

for result in engine.search('NLP', cached=False):
print(result.url, result.text)
print(ngrams(result.text, n=2))

Explanation

In the above example, we’re crawling Google for results containing the keyword ‘NLP’. It will print all of the result URLs and the text. It also prints the bigrams for each result. While this is really a fairly pointless example, it shows how easy crawling and NLP tasks can be performed in unison by Pattern.

Natural Language Tool KIT [NLTK]

It is one of the greatest library available out there to train NLP models. this library is very easy to use. It is a beginner-friendly library for NLP. It has a lot of pre-trained models and corpora which helps us to analyze the things very easily.

Implementation Example

# Importing Librariesimport nltktext = 'Earth Revovles around the Sun.'# Token Generator--> Separates the sentence into tokens
tokens = nltk.word_tokenize(text)
for token in tokens:
print(token)

Explanation

The word_tokenize() will help us to create a group of text into simple words. this we call it as a token.

Output

Earth
Revolves
around
the
Sun

TextBlob

It is based on both Pattern and NLTK which provides a great API call for all the common NLP Operations.

It isn’t the fastest or most complete library, it offers everything that one needs on a day-to-day basis in an extremely accessible and manageable way.

Implementation Example

# Importing Module
from textblob import TextBlob

text = 'Earth Revolves around the sun.'

text_blob = TextBlob(text)

for sentence in blob.sentences:
print(sentence.sentiment.polarity)

Explanation

In the above example, we consider the same example text which is taken in the NLTK library. In this program initially, textblob splits the sentence into tokens and on that tokens, it is performing sentiment analysis which is nothing but the calculating sentence polarity.

Conclusion

I hope you got a complete idea of popular libraries available in python for Natural Language processing. Since the NLP is a difficult subject it might not be understood properly for beginners. But those who want to get started with NLP then you should go for this any of the above libraries.

If you are new to natural language processing, I would highly recommend starting with the NLTK is going to be a good start.

--

--

Sri Manikanta Palakollu
Analytics Vidhya

Problem Solver || Started Journey as a Programmer || Techie Guy || Bibliophile || Love to write blogs and Articles || Passionate About sharing Knowledge.