The world’s leading publication for data science, AI, and ML professionals.

Alternative distributional semantics approach

Resolving ambiguity is not ambiguous anymore…

Hands-on Tutorials

If you landed here, it means that you’re curious enough to learn more about the different ways to resolve ambiguity in NLP/NLU.

Background information is the reason of ambiguity for machines. This ambiguous information arises from the human natural language used in communication. The process to "translate" this language into a comprehensive artificial language for machines could produce ambiguity. This can be explained by the fact that human language itself is inherently informal and ambiguous.

Traditional distributional semantics approaches are based on words vectorization to address semantics. The alternative shown here is based on a knowledge graph with straightforward requests to solve lexical ambiguity.

This tutorial will stress out some ambiguity resolution tasks that we can use to solve this problem, via an easy and convenient application, thus, a Natural Language API (NL API).

Photo by Paweł Czerwiński on Unsplash
Photo by Paweł Czerwiński on Unsplash

Natively machines cannot interpret nor understand text; to do so, and to resolve language ambiguity, they need the text to be annotated through multi-level linguistic analysis. The phenomena that handles ambiguity is called "Disambiguation". This is the process which helps the machine to detect meaning (semantics) in a text. The meaning is determined considering context, syntax and words relations.


The following article will emphasize different approaches that can be used to help machines reduce ambiguity and reveal text comprehension, such as Lemmatization, POS Tagging and so on.

The work will be based on the use of a Natural Language API called expert.ai NL API.

Expert.ai NL API ?

The expert.ai Natural Language API is an application capable to provide multiple level of information within a text through few lines of code. The API provides deep language understanding in order to build NLP modules. It shows a subset of features that perform deep linguistic analysis (Tokenization, Lemmatization, PoS tagging, morphological/syntactic/semantic analysis). On top of that, the library allows to solve problems such as Named Entity Recognition (NER), semantic relationships between these entities and Sentiment analysis. Document classification is also available through a ready-to-use taxonomy.

1/ How to use Expert.ai NL API for Python ?

Installation of the library

First things first, you need to install the client library using this command:

pip install expertai-nlapi

The API is available once you have created your credentials on the developer.expert.ai portal. The Python client code expects your developer account credentials to be specified as environment variables:

  • Linux:

export EAI_USERNAME=YOUR_USER export EAI_PASSWORD=YOUR_PASSWORD

  • Windows:

SET EAI_USERNAME=YOUR_USER SET EAI_PASSWORD=YOUR_PASSWORD

YOUR_USER is the email address you specified during registration. You can also define credentials inside your code:

2/ Deep linguistic analysis

Linguistics separates the analysis of language into different parts. All these branches are interdependent, everything is linked in language. The document is processed with multi-level text analysis; each text is split into sentences, which are parsed into tokens, lemmas and parts-of-speech, finding the relations between syntactic constituents and predicates, and interpreting syntax to build a full dependency tree.

To retrieve these pieces of information, you start by importing the client section of the library:

Let’s take an example to illustrate these operations:

"Sophia is a social humanoid robot developed by Hong Kong-based company Hanson Robotics. Sophia was activated on February 14, 2016."

a/ Text Subdivision

This operation allows to divide the text from the longest form to the smallest, in this case, starting from the paragraph level, going through the sentences and phrases, until the tokens level. When the token is a collocation (compound word), text subdivision can get deeper in the analysis until the atom level which it cannot be further divided.

Once you have imported the library and instantiated the client, you should set the language of the text and the parameters of the API:

Inside the API request, you should mention the sentence to analyze inside the body and the language inside the params. The resource parameter is related to the operation you need to perform on your text, for instance, disambiguation here which is based on a multi-level text analysis provided with the expert.ai NL API.

This multi-level text analysis is generally broken down into three stages:

  1. A lexical analysis: a text subdivision phase that allows the text to be broken down into elementary entities (tokens).
  2. A syntactic analysis: consists in the recognition of combinations of lexemes forming syntactic entities (including Pos Tagging).
  3. A semantic analysis: The Disambiguation occurs at this level, it detects the meanings of these entities according to the communicative context and the possible relationships between them.

The lexical analysis starts by this first subdivision:

Paragraphs: Sophia is a social humanoid robot developed by Hong Kong-based company Hanson Robotics. Sophia was activated on February 14, 2016.

Since our text is already a paragraph (two sentences here), the output of the subdivision provides the same text as the input. Let’s try to break the paragraph into the sentence level, in this case, we just need to modify the element .paragraphs to .sentences. The most common way of delimiting a sentence is based on the dot (.):

Sentences: Sophia is a social humanoid robot developed by Hong Kong-based company Hanson Robotics.
Sentences: Sophia was activated on February 14, 2016.

We have indeed two sentences as a result. Let’s get deeper into the subdivision to retrieve the phrase level. We use the same procedure as above, modifying the element .sentences with .phrases:

Phrases: Sophia              
Phrases: is a social humanoid robot
Phrases: developed           
Phrases: by Hong Kong-based company Hanson Robotics
Phrases:.                   
Phrases: Sophia              
Phrases: was activated       
Phrases: on February 14, 2016
Phrases:.

We notice that once we go deeper into the subdivision, we get more elements in the result. We can get the number of the phrases inside our text as well:

phrases array size:  10

b/ Tokenization

Furthermore, we can break down the phrase level into smaller units which are the tokens. This is the "Tokenization" task and it is very common in NLP. It helps the machine to understand the text. To perform the tokenization with Python, we can use the .split() function as shown below:

For example, consider this sentence:

"CNBC has commented on Sophia’s lifelike skin and her ability to emulate more than 60 facial expressions."

These are the tokens of the sentence ['CNBC', 'has', 'commented', 'on', 'the', "robot's", 'lifelike', 'skin', 'and', 'her', 'ability', 'to', 'emulate', 'more', 'than', '60', 'facial', 'expressions.']

Without specifying the delimiter inside the split(), the text is separated according to the space. With the expert.ai NL API , we can perform the tokenization as well, with additional features. In other words, the API provides different word-level tokens analysis; the tokens resulting from the API could be words, characters (like contractions) and even punctuations. Let’s see how to perform this task with the API, we use the same procedure as above, modifying the element .phrases with .tokens:

TOKEN                
----                
CNBC                
has                 
commented           
on                  
the                 
robot               
's                  
lifelike            
skin                
and                 
her                 
ability             
to                  
emulate             
more                
than                
60                  
facial expressions  
.

We notice that the tokens are either words like skin , ability, emulate, contractions such as : ‘s, numbers: 60 or even punctuation like the dot (.).

The tokenization results collocations as well like facial expressions which is impossible with the split() function. The API is capable to detect compound words inside the sentence, according to the positions of the words and the context. This collocation can be further divided, to the atom level, which is the last small lexical unit we can have:

CNBC                
has                 
commented           
on                  
the                 
robot               
's                  
lifelike            
skin                
and                 
her                 
ability             
to                  
emulate             
more                
than                
60                  
facial expressions  
     atom: facial              
     atom: expressions         
.

c/ PoS Tagging

The tokenization leads to the second process in NLP, which is the POS tagging (Parts Of Speech tagging), working together in order to allow the machine to detect the meaning of the text. At this stage, we introduce the syntactic analysis that includes the POS tagging task. This latter consists in assigning a POS or a grammatical class to each token. POS characterizes the morpho-syntactic nature of each token. These labels attributed to the textual elements can reflect a part of the meaning of the text. We can list few parts of speech, commonly used, in the English language: DETERMINER,NOUN, ADVERB, VERB, ADJECTIVE, PREPOSITION, CONJUNCTION, PRONOUN, INTERJECTION

One word having the same form with other words (Homograph) can have different meanings (Polysemic). This word can have different POS even though it has the same form. The grammatical class depends on the position of the word in the text and its context. Let’s consider these two sentences:

The object of this exercise is to raise money for the charity.__A lot of people will object to the book.

From a linguistic point of view, in the first sentence, object is a noun whilst in the second one, object is a verb. The PoS Tagging is a crucial step towards the Disambiguation. Depending on this tagging, the meaning of the words is inferred from the context, from the form of the word (for instance, Capital letters in the beginning of a Proper Noun), from the position (SVO word order),etc. Consequently, semantic relationships are produced between the words; linking each concept to one another depending on the type of the relationship, building together a knowledge graph. Let’s try to use our API to generate the POS tagging for the previous two sentences:

we start by importing the library and creating the client as below:

we have to declare the variables related to each sentence, object_noun for the sentence where the word object is a noun, and object_verb for the sentence with the verb:

The word object has the same form in both sentences but has a different POS. In order to demonstrate it with the expert.Ai NL API, we need to call this latter.

In the beginning, we specify the text on which we want to proceed the POS Tagging, for the first sentence, it’s the object_noun, for the second, it’s the object_verb. Then, the language of the examples and in the end the resource; that is related to the analysis performed, in this case the Disambiguation, like the following:

Once we set these parameters, an iteration over the tokens is necessary to assign a POS to each one, respectively for each example;

       Output of the first sentence :  

TOKEN                POS   
The                  DET   
object               NOUN  
of                   ADP   
this                 DET   
exercise             NOUN  
is                   VERB  
to                   PART  
raise                VERB  
money                NOUN  
for                  ADP   
the                  DET   
charity              NOUN  
.                    PUNCT 
     Output of the second sentence :  

TOKEN                POS   
A lot of             ADJ   
people               NOUN  
will                 AUX   
object               VERB  
to                   ADP   
the                  DET   
book                 NOUN  
.                    PUNCT

On the one hand, object is indeed a NOUN, preceded by the article/Determiner (DET) The. On the other hand, the word object is in fact the VERB of the sentence which links the subject "a lot of people" to its object "the book".

Traditional POS Tagging tools used in NLP usually work with the same type of information to label a word in a text: its context and its morphology. The peculiar feature of POS Tagging within expert.Ai NL API is not only to identify for each token a grammatical label, but also to introduce the meaning.

In other words, one word can share the same form with other words but it includes several meanings (Polysemy). Each meaning is conveyed in a concept, linked with other concepts, creating a knowledge graph. The word object seen above has more than one meaning, hence, it belongs to different semantic concepts, that we call "Syncons" within the knowledge graph of the expert.ai NL API. The POS Tagging can reveal different labels of the same word, thus, different meanings. That is what we can examine with the API:

        Concept_ID for object when NOUN  

TOKEN                POS             ID    
The                  DET                 -1 
object               NOUN             26946 
of                   ADP                 -1 
this                 DET                 -1 
exercise             NOUN             32738 
is                   VERB             64155 
to                   PART                -1 
raise                VERB             63426 
money                NOUN             54994 
for                  ADP                 -1 
the                  DET                 -1 
charity              NOUN              4217 
.                    PUNCT               -1 
     Concept_ID for object when VERB  

TOKEN                POS             ID    
A lot of             ADJ              83474 
people               NOUN             35459 
will                 AUX                 -1 
object               VERB             65789 
to                   ADP                 -1 
the                  DET                 -1 
book                 NOUN             13210 
.                    PUNCT               -1

As can be noted, the NOUN object belongs to the concept with the ID 26946. This concept includes other words with the same meanings (synonyms). By contrast, its homograph in the second sentence is related to the ID 65789. These ID are the identification of each concept inside the Knowledge Graph.

Therefore, a different POS leads to a different meaning, even though we have the same morphology of the word.

Please notice that the words having -1 as an ID such as ADP (Adposition referring to prepositions and postpositions), PUNCT (for Punctuation), DET (for Determiner) and so on, are not available in the knowledge graph because they are not inherently Semantics.

d/ Lemmatization

Here is another core task in Natural Language Processing, called the Lemmatization. It’s an important step, along with Tokenization and POS Tagging, to perform information extraction and text normalization. Particularly useful for opinion mining and emotion detection, lemmas allow the emergence of major semantic trends in a document.

The lemmatization is a linguistic resource that groups certain tokens together. In a nutshell, it associates with each token the canonical form which represents it in a dictionary:

  • The infinitive for VERBS: wore, worn -> wear / ran, running, runs -> run
  • the singular form for NOUNS: mice -> mouse / die -> dice
  • etc….

The concept (or Syncon) can contain many lemmas (lexemes). During the Disambiguation process, each token identified in the text is returned to its base form, removing inflectional affixes. Each lemma is associated to a concept in the Knowledge graph. Therefore, The lemmatization enables to reduce the set of distinct tokens to the set of distinct lemmas. This can be explained through this example;

Initially, a hearer of the lexeme "living" can discern, almost unconsciously, what the word means. This is possible for humans by making inferences based on the knowledge of the world, etc. This is quite impossible for machines if the context is not present.

For the machine to predict several meanings that a word, with same spelling and same sound arises, the lemmatization is the key solution to handle this lexical ambiguity.

We can perform this task with the expert.ai NL API. Let’s consider these two examples:

She’s living her best life.__What do you do for a living?


Output of the first sentence :  

TOKEN LEMMA POS
She she PRON
‘s ‘s AUX
living live VERB
her her PRON
best good ADJ
life life NOUN
Output of the second sentence :

TOKEN LEMMA POS
What what PRON
do do AUX
you you PRON
do do VERB
for for ADP
a a DET
living living NOUN
? ? PUNCT



As stated above, "living" belongs to two different lemmas, depending on the context and its position within the sentence. In the first example, **living** corresponds to the lemma **"live"** which is the VERB of the sentence. On the contrary, **living** in the second sentence is a NOUN and has the lemma **"living"**. The meaning is different as well, the first lemma describes the concept of _"remaining alive"_, however, living as a noun belongs to the concept of _"an income or the means of earning it"_.

Consequently, the lemmatization helps the machine to deduce the meaning of a homographic word.

---

## Conclusion

One expression or word can have more than one meaning, therefore, a problem in language comprehension for machines. Thanks to very basic NLP tasks like lemmatization, PoS Tagging, etc., and few lines of code, we can resolve this ambiguity, and this is what I aimed to share in this article.

Hoping that resolving ambiguity is less ambiguous now...

Related Articles