The world’s leading publication for data science, AI, and ML professionals.

A Detailed, Novice Introduction to Natural Language Processing (NLP)

The Ultimate Code-Based Guide to Getting Started with NLP in Python

Getting Started

Photo by Joshua Hoehne on Unsplash
Photo by Joshua Hoehne on Unsplash

Language is the most important medium of communication we use. It encompasses both geographical and intellectual boundaries. Research suggests that processing natural spoken or written dialect can be useful in a variety of businesses ranging from search autocompletion/autocorrection, translation, social media monitoring, targeted advertising to analyzing surveys and in general increasing the overall understanding of foreign languages. In the story below we will implement two distinct algorithms using NLP. In one we will define a set of grammar rules and parse a sentence based on the rule, and in the second algorithm, we will web scrape a site and find what it is about.

Natural Language Processing (NLP) is a method within Artificial Intelligence (AI) that works with the analysis and building of intelligent systems that can function in languages that humans speak, for example, English. Processing of language is needed when a system wants to work based on input from a user in the form of text or speech and the user is adding input in regular use English.

  • Natural Language Understanding (NLU): The understanding phase of the processing is responsible for mapping the input that is given in natural language to a beneficial representation. It also analyzes different aspects of the input language that is given to the program.
  • Natural Language Generation (NLG): The generation phase of the processing is used in creating Natural Languages from the first phase. Generation starts with Text Planning, which is the extraction of relevant content from the base of knowledge. Next is the Sentence Planning phase where required words that will form the sentence are chosen. Ending with Text Realization is the final creation of the sentence structure.

Challenges in Natural Language Processing

  1. Lexical Ambiguity: This is the first level of ambiguity that occurs generally in words alone. For instance, when a code is given a word like ‘board’ it would not know whether to take it as a noun or a verb. This causes ambiguity in the processing of this piece of code.
  2. Syntax Level Ambiguity: This is another type of ambiguity that has more to do with the way phrases sound in comparison to how the machine perceives it. For instance, a sentence like, ‘He raised the scuttle with a blue cap’. This could mean one of two things. Either he raised a scuttle with the help of a blue cap, or he raised a scuttle that had a red cap.
  3. Referential Ambiguity: References made using pronouns constitute referential ambiguity. For instance, two girls are running on the track. Suddenly, she says, ‘I am exhausted’. It is not possible for the program to interpret, who out of the two girls is tired.
Natural Language Processing | Image by Author
Natural Language Processing | Image by Author

Building a Natural Language Processor

There are a total of 5 execution steps when building a Natural Language Processor:

  1. Lexical Analysis: Processing of Natural Languages by the NLP algorithm starts with identifying and analyzing the input words’ structure. This part is called Lexical Analysis and Lexicon stands for an anthology of the various words and phrases used in a language. It is dividing a large chunk of words into structural paragraphs and sentences.
  2. Syntactic Analysis / Parsing: Once the sentences’ structure is formed, syntactic analysis works on checking the grammar of the formed sentences and phrases. It also forms a relationship among words and eliminates logically incorrect sentences. For instance, the English Language analyzer rejects the sentence, ‘An umbrella opens a man’.
  3. Semantic Analysis: In the semantic analysis process, the input text is now checked for meaning, i.e., it draws the exact dictionary of all the words present in the sentence and subsequently checks every word and phrase for meaningfulness. This is done by understanding the task at hand and correlating it with the semantic analyzer. For example, a phrase like ‘hot ice’ is rejected.
  4. Discourse Integration: The discourse integration step forms the story of the sentence. Every sentence should have a relationship with its preceding and succeeding sentences. These relationships are checked by Discourse Integration.
  5. Pragmatic Analysis: Once all grammatical and syntactic checks are complete, the sentences are now checked for their relevance in the real world. During Pragmatic Analysis, every sentence is revisited and evaluated once again, this time checking them for their applicability in the real world using general knowledge.

Tokenization, Stemming, and Lemmatization

Tokenization

To read and understand the sequence of words within the sentence, tokenization is the process that breaks the sequence into smaller units called tokens. These tokens can be words, numerals or at times be punctuation marks. Tokenization is also termed word segmentation. This is a sample example of how Tokenization works:

Input: Cricket, Baseball and Hockey are primarly hand-based sports. Tokenized Output: "Cricket", "Baseball", "and", "Hockey", "are", "primarily", "hand", "based", "sports"

The ending and starting of sentences are called word boundaries, and this process is meant for understanding the word boundaries of the given sentences.

  • Sent_tokenize Package: This is sentence tokenization and converts the input into sentences. This package can be installed using this command in the Jupyter Notebook from nltk.tokenize import sent_tokenize
  • Word_tokenize Package: Similar to sentence tokenization, this package divides the input text into words. This package can be installed on Jupyter Notebook using the command from nltk.tokenize import word_tokenize
  • WordPunctTokenizer package: In addition to the word tokenization, this package also works on punctuation marks as a token. Installation can be done using from nltk.tokenize import WordPuncttokenizer

Stemming

When studying natural languages that humans use in conversations, some variations occur due to grammatical reasons. For instance, words like virtual, virtuality, and virtualization all mean the same in an essence but mean different in varied sentences. For NLTK algorithms to work correctly, they must understand these variations. Stemming is a heuristic process that understands the word’s root form and helps in analyzing its meanings.

  • PorterStemmer package: This package is built into Python and uses Porter’s algorithm to compute stems. Functioning is something like this, an input word of ‘running’ produces a stemmed word ‘run’ post operating on this algorithm. It can be installed into the working environment with this command from nltk.stem.porter import PorterStemmer
  • LancasterStemmer package: The functionality of the Lancaster stemmer is similar to Porter’s algorithm but has a lower level of strictness. It only removes the verb portion of the word from its source. For instance, the word ‘writing’ after running through the Lancaster algorithm returns ‘writ’. It can be imported into the environment with this command from nltk.stem.lancaster import LancasterStemmer
  • SnowballStemmer package: This also works the same way as the other two and can be imported using the command from nltk.stem.snowball import SnowballStemmerThese algorithms have interchangeable use cases although they vary in strictness.

Lemmatization

Adding a morphological detail to words helps in extracting their respective base forms. This process is executed using lemmatization. Both, vocabulary and morphological analysis result in lemmatization. This procedure aims to remove inflectional endings. The attained base form is called a lemma.

  • WordNetLemmatizer package: The wordnet function extracts the word’s base form depending on whether the presented word is being used as a noun or pronoun. The package can be imported with the following statement from nltk.stem import WordNetLemmatizer

Concepts of Data Chunking

Chunking as the name suggests is the process of dividing data into chunks. It is important in the Natural Language Processing realm. The primary function of chunking is to classify different parts of speech and short word phrases like noun phrases. Once tokenization is complete and input is divided into tokens, chunking labels them for the algorithm to better understand them. Two methodologies are used for chunking and we will be reading about those below:

  • Chunking Up: Going up or chunking upwards is zooming out on the problem. In the process of chunking up, the sentences become abstract and individual words and phrases of the input are generalized. For instance, a question like, ‘What is the purpose of a bus?’ after chunking up, will answer ‘Transport’
  • Chunking Down: The opposite of chunking up, during downward chunking we move deeper into the language and objects become more specific. For instance, an example like, ‘What is a car?’ will yield specific details like color, shape, brand, size, etc. of the car post being chunked down.

Defining Grammar Rules and Implementing Chunking: In the upcoming section we will implement the rules of Engilsh Grammar on a chunk of words. This process will require the output to be shown as a pop-up on the screen’s display. **If you are running the code locally in Jupyter Notebook, there are no additional steps required. ***But for running the code in Colab, we need to mount a virtual display on the Colab local environment. Steps to do this are shown below.

***Building Virtual Display Engines on Google Colab

To show Trees as the output our code needs to open a new display window. Tk, or tkinterwill normally create GUI (like a new window) for your interface. But Colab is run on the webserver in the cloud. For this reason, it cannot open a new window on the local machine where it is running. The only interaction Colab offers is through a web notebook interface. To display the NLTK trees that we need to show the chunking code, please run the below code in your Colab environment.

### CREATE VIRTUAL DISPLAY ###
# Install X Virtual Frame Buffer
!apt-get install -y xvfb 
import os
# create virtual display with size 1600x1200 and 16 bit color. Color can be changed to 24 or 8
os.system('Xvfb :1 -screen 0 1600x1200x16  &')    
# tell X clients to use our virtual DISPLAY :1.0.
os.environ['DISPLAY']=':1.0'    
### INSTALL GHOSTSCRIPT (Required to display NLTK trees) ###
!apt install ghostscript python3-tk

We will be performing Noun-Phrase chunking in this example which is a category of chunking. Here we predefine grammatical notions that the program will use to perform the chunking.

NP-chunks are defined so as not to contain other NP-chunks and relying for most information on part-of-speech tagging.

Noun-Phrase Chunking: In the code below, we will perform Noun-Phrase (NP) chunking where we search for chunks corresponding to individual noun phrases. To create an NP-chunker, we will define a chunk grammar rule (shown in the code below). The flow of the algorithm will be as follows:

a. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). b. We use this grammar on a sample sentence to build the chunk parser, and display the output graphically as a tree.

NLTK Tree | Output of the Code above | Image by Author
NLTK Tree | Output of the Code above | Image by Author

Topic Modelling and Identifying Patterns in Data

Documents and discussions are generally always revolving around topics. The base of every conversation is one topic and discussions revolve around it. For NLP to understand and work on human conversations, it needs to derive the topic of discussion within the given input. To compute the same, algorithms run pattern matching theories on the input to determine the topic. This process is called topic modeling. It is used to uncover the hidden topics/core of documents that need processing. Topic modeling is used in the following scenarios:

  • Text Classification: It can improve the classification of textual data since modeling groups similar words, nouns, and actions together and does not use individual words as singular features.
  • Recommender Systems: Systems based on recommendations rely on building a base of similar content. Therefore, topic modeling algorithms can best utilize recommender systems by computing similarity matrices from the given data.

    The output of the categorization code above. We observe that the NLP algorithm rightly identifies Tesla to be the most talked-about topic of the weblink. | Image by Author
    The output of the categorization code above. We observe that the NLP algorithm rightly identifies Tesla to be the most talked-about topic of the weblink. | Image by Author

Complete Code Repository

For both the code notebooks executed above, the complete code with all input and output is present in the repository linked below. I would strongly advise anyone following the article to walk through the code side-by-side for the best understanding of the concepts discussed above.

ai-with-python-series/Natural Language Processing

Conclusion

Photo by David Ballew on Unsplash
Photo by David Ballew on Unsplash

The study of languages and their connection to human cognition is fascinating. In the article and the code above, we performed two basic language processing tasks – noun phrase chunking and categorization of text. These are basic implementations of one of the most commonly used NLP techniques. Machines are inherently designed to work with numbers and language is the exact opposite, dealing with words, phrases, grammar, and many complex constructs that cannot be taught to a machine directly. Therefore, NLP is an extremely vast area of study. The aim of this tutorial was to provide a starting point for entering this multiverse. I would suggest you follow the links mentioned in the references section below to get a better grasp of Natural Language Processing.


Interesting Machine Learning Reads

Implementing an End-to-End Machine Learning Workflow with Azure Data Factory

Logic Programming and the Design of Humanistic AI using Python

The Ultimate Guide to Functional Programming for Big Data

About Me

I am a Data Engineer and an Artificial Intelligence Researcher currently working in Microsoft Xbox game analytics where I implement similar pipelines in my daily work, to analyze game acquisitions, usage, and health. Apart from my professional work, I am also researching ways of implementing AI to balance the economics of regions across the world that have been impacted by gradual climatic changes over the years. Please feel free to connect with me on Twitter or LinkedIn for any discussions, questions, or projects you would like to collaborate on.


References

  1. https://searchenterpriseai.techtarget.com/definition/natural-language-processing-NLP
  2. https://www.analyticsvidhya.com/blog/2020/07/top-10-applications-of-natural-language-processing-nlp/
  3. https://www.wonderflow.ai/blog/natural-language-processing-examples
  4. https://www.quora.com/What-is-pragmatic-analysis-in-NLP
  5. https://aclanthology.org/W17-5405.pdf
  6. https://www.oak-tree.tech/blog/data-science-nlp
  7. https://towardsdatascience.com/natural-language-processing-nlp-for-machine-learning-d44498845d5b
  8. https://www.geeksforgeeks.org/nlp-chunking-and-chinking-with-regex/
  9. https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb
  10. https://www.analyticssteps.com/blogs/what-are-recommendation-systems-machine-learning

Related Articles