The world’s leading publication for data science, AI, and ML professionals.

Visualizing what docs are really about with

A brief tutorial on how to retrieve and visualize hidden relevant data in textual documents, using the NL API.

Photo by Patrick Fore on Unsplash
Photo by Patrick Fore on Unsplash

If you have ever worked with text datasets, you know how difficult it is to retrieve central information in the dataset, avoiding noise or redundancy. Many approaches can get good results if you want to train your own model, but this requires time and resources. In the following tutorial we will explore how to retrieve hidden information in texts using NL API Python SDK, that requires no model training, and then visualize the results in graphs that can be added to a report.

What is NL API?

The Natural Language API is a service developed by, that can be used to easily build NLP applications. The library includes multiple features such as an NLP pipeline (tokenization, lemmatization, PoS tagging, dependency parsing, syntactic parsing, word-sense disambiguation) and a ready-to-use IPTC Media Topics classifier, geographic taxonomy and sentiment analysis. Named entity recognition (NER) is also performed, where entities are linked to their unique identified in public knowledge bases (e.g., Wikidata, GeoNames, DBpedia). The NL API is a useful tool for anyone interested in NLP since it is easy to use and offers a wide range of functionalities. The technology is developed by which has more than 20 years of experience delivering linguistic solutions, making the API a perfect fit for anyone interested in building a business-ready solution.

Step 1: Install NL API python SDK

In this tutorial, we will use the Python client for the Natural Language API to add natural language understanding capabilities to Python apps. You can use pip to install the library:

$ pip install Expertai-nlapi

Set up

To use the NL API you need to create your credentials on the Developer Portal. The Python client code expects developer account credentials to be available as environment variables:

  • Linux:
  • Windows

You can also define them inside your code:

import os
os.environ["EAI_USERNAME"] = YOUR_USER

Step 2: Key elements

Now we can use the out of the box features on a collection of documents to retrieve their key elements. First things first: import the library and initialize the object.

Then, set the text and the language:

We can perform our first analysis and take a look at the document’s content in a nutshell. Key elements are obtained with the relevants analysis and identified whitin the document as main sentences, main concepts (called "syncons"), main lemmas and relevant topics:

The engine performs word-sense disambiguation thanks to a proprietary knowledge graph that stores all concepts for each language. This helps the engine to understand what concepts appear in a document. Notice how concepts and lemmas differ. A concept might be represented by one or more lemmas, that are synonyms. For example, if we find home and house in a text, technology understands that the two lemmas have the same meaning and assigns both of them the preferred label for the concept – which in this case would be house. Let’s take a look at the most relevant concepts found in the previous text, accompanied by their relevance score:

CONCEPT              SCORE 

startup company      42.59
Facebook Inc.        41.09
Springfield          14.89
American               1.1

We can do the same for a collection of documents, retrieving relevant concepts for each document and aggregating them:

We can then visualize the results in a bar graph, selecting 20 most common concepts that are more common in our dataset:

Bar chart of 20 most common concepts in documents
Bar chart of 20 most common concepts in documents

Step 3: Document classification

Now, we will learn how to classify documents according to the IPTC Media Topics Taxonomy provided by the Natural Language API. We start with a sample text first to see how to set everything up, and then perform the same action on a dataset.

We can now request to classify the previous text based on the IPTC Media Topics categories:

For each category we can see its label, unique ID and a frequency whitin in the text:

CATEGORY                    IPTC ID    FREQUENCY

Earnings                    20000178     29.63
Social networking           20000769     21.95

With this new functionality on our dataset, we can retrieve and collect all the IPTC Media Topics categories triggered in the documents:

Now, in collection_of_categories we have each IPTC Media Topics category found and the number of documents in which it was found. We can now visualize a tag cloud of the topics of the dataset:

IPTC Media Topic classes word cloud
IPTC Media Topic classes word cloud

The whole code is available here.


With technology, it’s easy to access hidden information in texts such as the most important concepts or the topics they describe. As we have seen, from installation to visualization of meaningful data it takes only a few minutes! We hope that you enjoyed this Tutorial and look forward to learning about the textual exploration you perform using NL API.

Related Articles