NLP engine(Part-2) -> Best Text Processing tools or libraries for Natural Language Processing

Chethan Kumar GN
Towards Data Science
6 min readOct 5, 2018

--

NLP libraries — when, which and where to use them. NLTK, TextBlob, Spacy, CoreNLP, Genism, Polyglot.

One of the main questions that arise while building an NLP engine is “Which library should I use for text processing?” since there are many in the market and also “What is the need for the usage of NLP libraries?” these two are addressed here and helps you take the right step in the path for building NLP engine from scratch on your own.

This article is a part of an ongoing series: Part-1

“What is the need for the usage of NLP libraries?”

Natural language processing (NLP) is getting very popular today. NLP is a field of artificial intelligence aimed at understanding and extracting important information from text and further training based on text data. The main tasks include speech recognition and generation, text analysis, sentiment analysis, machine translation, etc.

These Libraries helps us to extract meaning from the text which includes the wide range of tasks such as document classification, topic modeling, part-of-speech (POS) tagging, and sentiment analysis etc.

“Which library should I use for text processing?”

Here is the list of the six most popular NLP libraries that are used everywhere not in any particular order though. It is up to you to select which is suitable for the task that you are trying to perform.

If these five are mastered then there is no need for you to look for other tons of NLP libraries out there. Even if you master one of them it is quite easy to switch between them to learn and use the other libraries.

  1. NLTK (Natural Language Toolkit): One of the oldest and used mostly for research and educational purpose.
  2. TextBlob: Built on top of NLTK, best for beginners. It is a user-friendly and intuitive NLTK interface. It is used for rapid prototyping.
  3. Spacy: Industrial Standard right now and the best among the bunch currently.
  4. CoreNLP(Stanford CoreNLP): Production-ready solution built and maintained by Stanford group but it is built in java.
  5. Genism: It is the package for topic and vector space modeling, document similarity.
  6. Polyglot: It is usually used for projects involving a language spaCy doesn’t support.

These libraries are not the only libraries but you could say that they are the “backbone of NLP” where mastering these will let you do any simple or advanced Natural Language Processing.

The best of the bunch right now is “Spacy” I cannot recommend it enough for NLP. If you are doing anything that you want it to be production ready and Industrial Standard please use Spacy, NLTK is only for academic purposes.

NLTK (Natural Language Toolkit)

source

NLTK is one of the oldest NLP. If you are a beginner and would have to learn the basics of NLP domain then NLP is for you. You can build appropriate models for the appropriate task that you ould to achieve.

“Once NLTK has been mastered it will become a playground for text analytics researchers”. NLTK has over 50 corpora and lexicons, 9 stemmers, and dozens of algorithms to choose from. It is basically for an academic researcher.

TextBlob

source

TextBlob is an interface for NLTK that turns text processing into a simple and quite enjoyable process, as it has rich functionality and smooth learning curve due to a detailed and understandable documentation.
Since it allows simple addition of various components like sentiment analyzers and other convenient tools. It is used for “rapid prototyping”.

Spacy

source

Spacy is written in Python & Cython. SpaCy cannot provide over 50 variants of solution for any task like NLTK does. “Spacy provides only one and the best one solution for the task, thus removing the problem of choosing the optimal route yourself”, and ensuring the models built are lean, mean and efficient. In addition, the tool’s functionality is already robust, and new features are added regularly.

CoreNLP(Stanford CoreNLP)

source

CoreNLP consists of a set of production-ready natural analysis tools. It is written in Java, not Python. Although there are Python wrappers made by the community. It is reliable robust, faster than NLTK(but spacy is much faster) and also supports multiple languages. Many organizations use CoreNLP for production implementations.

CoreNLP provides a great infrastructure for NLP tasks. However, the client-server architecture introduces some overhead that might be counterproductive for smaller projects or during prototyping

Genism

Is the package for the topic and vector space modeling, document similarity. “Gensim is not for all types of tasks or challenges, but what it does do, it does them well”. In the area of the topic modeling and document similarity comparison, and highly-specialized Gensim library has no equals there. So Genism is not a general NLP but its usage depends on the task at hand.

Polyglot

source

Polyglot slightly lesser-known library. It offers a broad range of analysis and impressive language coverage. Thanks to NumPy, it also works really fast. Using polyglot is similar to spaCy — it’s very straightforward and will be an excellent choice for projects involving a language spaCy doesn’t support. The library stands out from the crowd also because it requests the usage of a dedicated command in the command line through the pipeline mechanisms. Definitely worth a try

Conclusion:

NLTK is more academic. You can use it to try different methods and algorithms, combine them, etc. Spacy, instead, provides one out-of-box solution for each problem. You don’t have to think about which method is better: the authors of Spacy already took care of this. Also, Spacy is very fast (several times faster than NLTK). One downside is the limited number of languages Spacy supports. However, the number of supported languages is increasing consistently. So, we think that Spacy would be an optimal choice in most cases, but if you want to try something special you can use NLTK

Here you would not have understood the meaning for tokenization, topic modeling, intents etc I will cover it in my next post NLP Engine(Part-3).

Credits:

  1. https://www.kdnuggets.com/2018/07/comparison-top-6-python-nlp-libraries.html
  2. https://elitedatascience.com/python-nlp-libraries
  3. https://sunscrapers.com/blog/6-best-python-natural-language-processing-nlp-libraries/

Make sure to follow me on medium, linkedin, twitter, Instagram to get more updates. And also if you liked this article make sure to give a clap and share it.

Join our WhatsApp community here.

--

--