Natural Language Processing Pipeline Decoded!

Ananya Banerjee
Towards Data Science
7 min readMay 15, 2020

--

Source

Natural Language is the language that we write, speak and understand. Almost all known languages in the world fall under the umbrella of Natural Languages. Natural Language Processing is the task of processing written forms of language and making a computer understand them.

Let’s talk about some very basic tasks that are required in order to make the natural language Machine or Deep Learning model ready.

Sentence Segmentation

Converting to Lowercase

Tokenization of Words

Removing Punctuations, Special Characters and Stopwords

Lemmatization / Stemming

Creation of Bag of Words Model / TF-IDF Model

Let’s talk about each of them one by one.

Sentence Segmentation is a well known subtask of Text Segmentation. Text Segmentation is basically dividing the given text into logically decipherable units of information. An example of one such logical unit is a sentence. Thus, the task of dividing the given text into sentences is known as Sentence Segmentation. This task is the first step towards processing text. Dividing a document containing lots of text into sentences helps us process the document, sentence by sentence, thereby not losing essential information that they may contain.

Please do note that sentence segmentation is dependent on the nature of the document and the type of sentence boundaries that the document adheres to. For example, in one document, text can be divided into sentences based on “.” or full stop while in another document, text can be divided on the basis of newline character or “\n”. Thus, before doing sentence segmentation it is essential to look at your document and find a reasonable sentence boundary before attempting sentence segmentation.

Sentence Segmentation Illustration

The above picture shows an example of how spacy helps perform sentence segmentation by dividing the given text into 2 sentences using full stop.

Next task is usually converting all sentences to lowercase. This is essential in problems where you don’t want to differentiate between words based on their case. For example, “Run” and “run” are same words and should not be deemed as two different words by your model if your task is classification.

Conversion to Lowercase Illustration

One good counter example where converting to lowercase may lead to loss of vital information, is the problem of Named Entity Recognition(NER). Identifying Named Entities becomes a much tougher task for a system if all the words in a sentence are converted to lowercase. Even libraries like spacy, etc fail to identify accurate Named Entities if all words are converted to lowercase. Thus, its essential to understand the nuances of your problem before attempting to lowercase all the words in your document.

The next task to understand is Word Tokenization. Tokenization is the process of dividing a sentence into words. This is done so that we can understand the syntactic and semantic information contained in each sentence (of the corpus). Thus, we decipher the relevance of a sentence by analyzing it word by word, thereby making sure that no loss of information occurs. One can perform tokenization of a sentence based on different heuristics or word boundaries such as space, tab, etc. One such example, is shown below.

Tokenization of Word Illustration

As you can see, spacy detects word boundaries and helps to tokenize the given text.

Next, we remove punctuation to ensure that we do not have “,”, “.” etc in our list of tokens. This is important in several problem types because we don't usually care about punctuations while processing natural language using Machine or Deep Learning algorithms. Thus, removing them seems like the smart thing to do. One can either remove punctuations by traversing your list of tokens or you can remove them from every sentence right from the get go. The latter is shown below.

Punctuation Removal Illustration

The next step is usually removing special characters such as “!@#$%^&*” from the list of tokens received after tokenization. This is done according to need and is highly dependent on the kind of problem that you are trying to solve. For example, you could be trying to detect tweets in a given corpus and removing special characters like ‘@’ might not help you in your endeavor as people usually use ‘@’ in tweets. Shown below is a code fragment that helps remove any special characters from the sentence(using regex library of python: re) if your problem demands it.

Special Character Removal Illustration

Another significant step is removal of stopwords. Stopwords are the most commonly occurring words in any language. For the sake of convenience, let’s assume that English is our primary language. Some of the most common stop words are “in”, “and”, “the”, “a”, “an”, etc. This is an important step because you don’t want your model to waste time on words which do not carry any significant meaning and stopwords hardly ever contain any meaning on their own. They can very easily be removed from a sentence or a list of tokens without incurring much loss of information, thereby speeding up the training process of your model. Thus, it’s almost always a good idea to remove them before trying to train your model.

The image given shows how to do this using nltk.

Stopwords Removal Illustration

The next task is usually Lemmatization and/or Stemming. Both process involve normalization of a word so that only the base form of the word remains, thereby keeping the meaning intact but removing all inflectional endings. This is an essential step since you don’t want you model to treat words like “running” and “run” as separate words.

Lemmatization uses morphological analysis and vocabulary in order to identify base word form (or lemma) while Stemming usually chops off word endings such as “ing”, “s”, etc in the hope of finding the base word. The picture shown below shows the difference between the two.

Lemmatization v/s Stemming Illustration

As you can see, the lemmatizer in the above picture correctly identifies that the word “corpora” has base form “corpus”, while the Stemmer fails to detect that. However, the stemmer correctly identifies “rock” as the base form for “rocking”. Thus, using either lemmatization or stemming or both is highly dependent on your problem requirements.

Now that we have discussed the basic ideas necessary to process the text data, let’s talk about how to convert text into Machine Learnable form. One such technique is creation of Bag of Words model. Bag of words is simply counting the number of occurrences of all words given in the text.

Bag of Words Model Illustration

The code given above first takes a corpus containing 4 sentences. Then it uses sklearn’s CountVectorizer to create a bag of words model. In other words, it creates a model which contains information about how many times each unique word in the corpus(that we get using vectorizer.get_feature_names()) occurs in every sentence. For the sake of levity, I have added a column “Sentence” which helps us understand which count value is corresponding to which sentence. Also, the last line in the code creates a “bow.csv” file which contains all the aforementioned counts corresponding to all the words as shown below.

Bag of Words Model Illustration

Another such model is called Term Frequency Inverse Document Frequency (TF-IDF) model. This model finds a tradeoff between Term Frequency(TF) and Inverse Document Frequency(IDF). Term Frequency is the number of times each word occurs in the given text corpus or document(as shown in Bag of Words Model Illustration). Inverse Document Frequency is the inverse of the number of times a word appears across documents in a corpus. In other words, where term frequency finds how common a word is, Inverse Document Frequency finds how rare a word is.

Now, let's look at how to do it in python.

TF-IDF Illustration

The code given above first takes a corpus containing 4 sentences. Then it uses sklearn’s TfidfVectorizer to create a TF-IDF model which contains their TF-IDF values. The code further creates a file “tfidf.csv” which contains the unique words of the corpus as columns along with their corresponding TF-IDF values as rows to help understand the idea behind the creation of the model. An example of the kind of file it creates is as shown below.

TF-IDF Model Illustration

Now, you can use any of these models as training data for problems such as text classification, sentiment analysis, etc. Now that you understand the basics of handling textual data, you are ready to begin your first project!

The entire code for this article is available here. Please note that every cell in the jupyter notebook can be run independently(due to cell specific import statements for the sake of levity) and you may wish to import everything at the top of your code only once instead of importing multiple times!

I hope this article helped you understand how to use natural language and get started with your journey towards mastering NLP one day!

Thank you for reading!

P.S. If you have any questions or if you want me to write on any particular topic, please comment below.

References:

  1. Speech and Language Processing, 3rd Edition by Dan Jurafsky and James H. Martin
  2. Sklearn: https://scikit-learn.org/stable/
  3. Spacy: https://spacy.io/
  4. NLTK: https://www.nltk.org/

--

--

Software Engineer 3 @ eBay. University of Texas at Dallas Alumna. BITS Pilani Alumna. You can follow me @ https://www.linkedin.com/in/ananyabanerjee15/