The Best NLP Tools of Early 2020: Live Demos

The easiest way to start using NLP in your projects

Igor Kaufman
Towards Data Science

--

2019 was the year of NLP. The cutting edge models developed by Google, OpenAI, Facebook, and others became publicly available for a wider audience.

In this article, I’ve collected the best live demos of the Natural Language Processing (NLP) and Natural Language Understanding (NLU) tools available on the market as open-source or as a service (which don’t require registration or coding skills). With these demos, I will give you a high-level overview of what has been achieved in terms of natural language analysis as of early 2020.

The purpose of this article is not to dive deeply into aspects of technology or cover all the capabilities that these tools provide, but to gain clarity on what’s happening in the modern NLP world and better understand which achievements can be leveraged as practical tools out of the box. I personally use these demos to quickly validate ideas.

BERT — Understanding Texts

Demo: link.

BERT is a pre-trained model published by Google and is intended to better understand what people search for.

Try to feed it a paragraph of text and ask questions

Unlike the older context-free approaches such as word2vec or GloVe, BERT takes the surroundings of the word — the context — into account to understand the user input in a particular situation as soon as the same word may have different connotations or meanings in different contexts. It was first published in 2018, and starting from December 2019 BERT has been officially used in Google search.

GPT-2 — Creating Texts

Demo: link.

GPT-2 is a transformer model by OpenAI. It was first released in February 2019. Its main purpose is to predict the next word, given all of the previous words within a text. Until the end of 2019, only smaller, less coherent versions of GPT-2 have been published due to fear that it would be used to spread fake news, spam, and disinformation. However, in November 2019, OpenAI said it had seen “no strong evidence of misuse” and released the model in full.

Now you can play with it completing the text you provide.

The bold text was written manually…
… the rest was generated by GPT-2

Another place to play with GPT-2 and also XLNet is at Write With Transformer.

SpaCy — Implementing NLP in Production

Demo: link.

SpaCy is a free open-source NLP library developed by ExplosionAI. It’s aimed at helping developers in production tasks, and I personally love it. It also has nice visualization capabilities. Let’s take a look at some of its features.

Text Tokenization — in simple words, splitting a text into meaningful segments: words, articles, punctuation. Later these segments, together or separately, can be vectorized in order to compare one to another (word embeddings). As a simple example, the words ‘cat’ and ‘fluffy’ are closer in the vector space than ‘cat’ and ‘spaceship’.

Here’s another example trained on Reddit comments that shows the closest concepts to the words you feed it: Sense2Vec.

Let’s take a look at the concepts close to ‘Kevin Spacey’. For 2015, we get other actors, but for 2019 we get primarily the #metoo context.

2015 vs 2019

Named Entity Recognition (NER) with a set of entities provided out of the box (persons, organizations, dates, locations, etc.). You can also train it with your own labels (i.e. addresses, counterparties, item numbers or others) — whatever you want to extract from the documents.

Here you can try it on standard classes: displaCy Named Entity Visualizer.

Example of Named Entities Recognition

Dependency Recognition — this helps to build rules when you need to extract the connected structures from sentences. Demo: displaCy Dependency Visualizer.

Dependencies and their types

AllenNLP — A Famous Alternative

Demo: link.

Powerful for prototyping with good text pre-processing capabilities. Less optimized for production tasks than SpaCy, but widely used for research and ready for customization with PyTorch under the hood. As it has a lot of functionality in common with SpaCy, it’s interesting to review the text entailment demo.

Textual Entailment takes a pair of sentences and predicts whether the facts in the first necessarily imply the facts in the second one. It doesn’t always work well, but that’s one of the challenges of conversational AI to reveal these kinds of implications.

Text Summarization — TL;DR

Demo: link.

Often, there’s so much information available that it’s important to distill it before processing. In general, there are two approaches: extraction-based and abstraction-based summarization. Extraction based means leaving only the most valuable words/sentences that represent the content of the article.

This demo uses the extraction-based approach: the only thing you need to do is to tell the number of sentences you want to see as a result. The abstraction-based approach lets the machine rephrase the text in a shorter way.

My previous article was too long (BTW you can find it here), so I want it’s content to be shortened to 2 sentences. Here’s what I got:

For the past three years, I have led Machine Learning and Data Science at DataArt, researching the main points of different businesses, proposing technological solutions and carrying out implementation.

Cloud providers are rapidly developing ML services, treading the same path that Big Data services did before them.

Oh, well, let’s move on to the cloud providers :)

Other web demos for extractive summarization: Summarizr, Online Text Summary Generator — Free Automatic Text Summarization Tool, Text Analysis API Demo | AYLIEN

Google AutoML Natural Language

Demo: link.

Google is the company that arguably processes the largest quantities of textual data in the world. Google AutoML usually provides higher accuracy when you do custom Named Entity Recognition than the open-source tools out of the box. It also has a convenient annotation UI, so you can do a jump-start. It’s equally good at other tasks.

Let’s take a look at the sentiment analysis example, scoring every sentence and the whole text in terms of positive or negative connotation.

The second sentence is positive, while the first one is negative

IBM Watson

Demo: link.

IBM is also a very strong competitor. In addition to what Google AutoML can do, in IBM Watson you may check the emotional characteristics and the concepts described in the text. Let’s try it on the dialogue:

“You know,” said Arthur, “it’s at times like this when I’m trapped in a Vogon airlock with a man from Betelgeuse, and about to die of asphyxiation in deep space that I really wish I’d listened to what my mother told me when I was young.”

“Why, what did she tell you?”

“I don’t know, I didn’t listen.”

Watson thinks it’s 72% sad, and the concepts also include “The Hitchhiker’s Guide to the Galaxy” which is the origin of the text.

Just to mention Microsoft Azure Text Analytics, the demo is here. The Amazon Comprehend demo requires registration and has quite a limited number of features.

A couple of other things worth mentioning as well…

HuggingFace — Very Useful for Production

Demo: link.

HuggingFace makes different NLP models easy to use in production by training them additionally or wrapping them into easily pluggable libraries. In particular, the coreference resolution module based on AllenNLP is one of the most popular solutions on the market.

It’s very useful when you need to preprocess text to understand the internal dependencies.

Berkley Neural Parser

Demo: link.

The Berkley parser annotates a sentence with its syntactic structure by decomposing it into nested sub-phrases.

This is convenient if you need to extract the knowledge from the sentences that could be described with some sort of a template (e.g. noun phrases are something I’m looking for).

A few more things to take a look at are NLTK, Stanford CoreNLP, and TextRazor. Often these libraries are used for learning purposes and sometimes underpin the tools mentioned above.

--

--

ex-Head of AI, DataArt. Innovation and engineering management. Open for opportunities!