What is NLP & What Do NLP Scientists Do?

Shahariar Rabby
Towards Data Science
8 min readAug 18, 2019

--

I recently started working as an NLP developer at a company. I am obviously happy and relieved to be gainfully employed again.

But one thing I’ve noticed since I started working is that a good amount of people including my dad have asked me, “What’s NLP and what is it exactly that you do?”

Normally, I would refer them to my blog but I realized I have never written on this before. I’ve written a few articles about specific data science and machine learning concepts but I never personally defined what the profession and industry mean to me. So let’s rectify that right now.

Natural Language Processing or NLP is a field of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages.

What is NLP?

First a disclaimer — I am by no means an expert NLP Scientist. While I do have a fair bit of statistics and quant research experience, I consider myself somewhat new to this field.

Let’s start with what the world believes NLP to be:

Natural language processing is the technology used to aid the computer to understand the human’s natural language. It is not an easy task teaching machine to understand how human communicate.

In recent years, there have been significant breakthroughs in empowering computers to understand language just as we do.

In fact, a typical interaction between humans and machines using Natural Language Processing could go as follows:

1. A human talks to the machine

2. The machine captures the audio

3. Audio to text conversion takes place

4. Processing of the text’s data

5. Data to audio conversion takes place

6. The machine responds to the human by playing the audio file

What is NLP used for?

Natural Language Processing is the driving force behind the following common applications:

  • Language translation application such as google translate.
  • word processors such as Microsoft word Grammarly that employ NLP to check grammatical accuracy of the text.
  • Interactive Voice Response (IVR) applications used in call centers to respond to certain users’ requests.
  • A personal assistant application such as OK Google. Hay siri, Cortana` and Alexa.

Helping The Man Earn More Benjamins

So depending on your view of a capitalistic society, you may or may not be happy to hear that NLP scientists are all about driving growth or optimizing the bottom line (profits).

I mean unless you are a teacher or a firefighter or a social worker, then chances are that your role is all about helping your boss earn more Benjamins too. I will say though, in my opinion, good NLP scientists are on average able to impact the companies they work for more than many other job functions. Let me explain why (and also explain what NLP scientists do).

What an NLP Scientist do actually?

Syntactic analysis and semantic analysis are the main techniques used to complete Natural Language Processing tasks.

Here is a description of how they can be used.

Image result for morphological segmentation nlp

1. Syntax

Syntax refers to the arrangement of words in a sentence such that they make grammatical sense.

In NLP, syntactic analysis is used to assess how the natural language aligns with the grammatical rules.

Computer algorithms are used to apply grammatical rules to a group of words and derive meaning from them.

Here are some syntax techniques that can be used:

  • Bag of Words: Is a commonly used model that allows you to count all words in a piece of text.

Words are flowing out like endless rain into a paper cup,

They slither while they pass, they slip away across the universe

Now let’s count the words:

This approach may reflect several downsides like the absence of semantic meaning and context, and the facts that stop words (like “the” or “a”) add noise to the analysis and some words are not weighted accordingly (“universe” weights less than the word “they”).

  • Lemmatization: It entails reducing the various inflected forms of a word into a single form for easy analysis.

Lemmatization resolves words to their dictionary form (known as lemma) for which it requires detailed dictionaries in which the algorithm can look into and link words to their corresponding lemmas.

For example, the words “running”, “runs” and “ran” are all forms of the word “run”, so “run” is the lemma of all the previous words.

  • Morphological segmentation: It involves dividing words into individual units called morphemes.
  • Word segmentation: It involves dividing a large piece of continuous text into distinct units.
  • Part-of-speech tagging: It involves identifying the part of speech for every word.
  • Parsing: It involves undertaking a grammatical analysis for the provided sentence.
  • Sentence breaking: It involves placing sentence boundaries on a large piece of text.
  • Stemming: It involves cutting the inflected words to their root form.

Affixes that are attached at the beginning of the word are called prefixes (e.g. “astro” in the word “astrobiology”) and the ones attached at the end of the word are called suffixes (e.g. “ful” in the word “helpful”).

The problem is that affixes can create or expand new forms of the same word (called inflectional affixes), or even create new words themselves (called derivational affixes). In English, prefixes are always derivational (the affix creates a new word as in the example of the prefix “eco” in the word “ecosystem”), but suffixes can be derivational (the affix creates a new word as in the example of the suffix “ist” in the word “guitarist”) or inflectional (the affix creates a new form of word as in the example of the suffix “er” in the word “faster”).

Ok, so how can we tell the difference and chop the right bit?

  • Solve complex Equations: Math is the backbone of the NLP task. After finishing all of the data processing tasks, Scientist sits together and solve the complex equations and tune the parameters for that specific task.

Can you imagine how many parameters NLP scientists to determine for designing a language model?

2. Semantics

Semantics refers to the meaning that is conveyed by a text. Semantic analysis is one of the difficult aspects of Natural Language Processing that has not been fully resolved yet.

It involves applying computer algorithms to understand the meaning and interpretation of words and how sentences are structured.

Here are some techniques in semantic analysis:

  • Named entity recognition (NER): It involves determining the parts of a text that can be identified and categorized into preset groups. Examples of such groups include names of people and names of places.
  • Word sense disambiguation: It involves giving meaning to a word based on the context.
  • Natural language generation: It involves using databases to derive semantic intentions and convert them into human language.

Why is NLP difficult?

Natural Language processing is considered a difficult problem in computer science. It’s the nature of the human language that makes NLP difficult.

The rules that dictate the passing of information using natural languages are not easy for computers to understand.

Some of these rules can be high-leveled and abstract; for example, when someone uses a sarcastic remark to pass information.

On the other hand, some of these rules can be low-leveled; for example, using the character “s” to signify the plurality of items.

Comprehensively understanding the human language requires understanding both the words and how the concepts are connected to deliver the intended message.

While humans can easily master a language, the ambiguity and imprecise characteristics of the natural languages are what makes NLP difficult for machines to implement.

Working with Natural Language processing requires lots of processing power to solve lots of complex equations. GPT-2 8B is the largest Transformer-based language model ever trained, at 24x the size of BERT & 5.6x the size of GPT-2.

Guss how many days it will take to finish the train on your 940mx GPU? Approximately 5 years, and what about train a big model like this in a MacBook? Hopefully more than 100 years only.

How does Natural Language Processing Works?

NLP entails applying algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand.

When the text has been provided, the computer will utilize algorithms to extract meaning associated with every sentence and collect the essential data from them.

Sometimes, the computer may fail to understand the meaning of a sentence well, leading to obscure results.

For example, a humorous incident occurred in the 1950s during the translation of some words between the English and the Russian languages.

Here is the biblical sentence that required translation:

The spirit is willing, but the flesh is weak.

Here is the result when the sentence was translated to Russian and back to English:

“The Rooh-Afza is good, but the meat is rotten.”

How does the future look like?

At the moment NLP is battling to detect nuances in language meaning, whether due to lack of context, spelling errors or dialectal differences.

On March 2016 Microsoft launched Tay, an Artificial Intelligence (AI) chatbot released on Twitter as an NLP experiment. The idea was that as more users conversed with Tay, the smarter it would get. Well, the result was that after 16 hours Tay had to be removed due to its racist and abusive comments:

Microsoft learned from its own experience and some months later released Zo, its second-generation English-language chatbot that won’t be caught making the same mistakes as its predecessor. Zo uses a combination of innovative approaches to recognize and generate conversation, and other companies are exploring with bots that can remember details specific to an individual conversation.

Although the future looks extremely challenging and full of threats for NLP, the discipline is developing at a very fast pace (probably like never before) and we are likely to reach a level of advancement in the coming years that will make complex applications look possible.

This post is inspired by some awesome post like:

--

--