Natural Language Processing Notes
Off the back of my last two posts, I thought it necessary we begin a new path. Together, we will walk through fundamental concepts in Natural Language Processing to serve as a kick start for the newcomers, and a reminder for the long-practitioners time that decides to read – starting with Sentiment Analysis.
Note: The post in this series will be created from notes I have taken off the Natural Language Processing Specialization on Coursera with extra things I’ve added because I thought it was useful.
Previous 2 posts:
In the examples given in our notes, we are intending on fitting a Logistic Regression model on to our features. I will not going much into the inner workings of Logistic Regression, however if you are extremely interested you may read "Algorithms from Scratch: Logisitc Regression".
Sentiment Analysis
The goal of sentiment analysis is to interpret and classify subjective data using natural language processing and Machine Learning.
Sentiment analysis is a very important feat in business today as the world has become much more digitalized, even more so since Covid-19. Many businesses employ sentiment analysis to detect social data, get a better understanding of their brand’s reputation, understand their customers in the digital world.
For instance, a business (or whoever) may decide to use sentiment analysis to automatically determine the polarity of a tweet made about their company (or whatever) in order to gain a better understanding of their brand’s reputation; This task may be defined as a supervised learning problem where we feed Input Features to a predictive model and get an output.

In order for us to perform sentiment analysis, we must first represent our text as features (what we represented as X in Figure 1), since computers do not understand text, before we can use it to classify text.
Well, how do we extract features? Great question. There are many ways. However, before we extract our features and build a Logistic Regression model to classify the sentiment of our data, we must discuss text preprocessing.
Text Preprocessing
Text on the internet is often defined as unstructured data – Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner (Source: Wikipedia). Therefore, text preprocessing is simply the task of organising our text into a predefined manner or into a predefined data model.
There are various techniques we can employ to preprocess our text, however for this post we will focus mainly on a few;
Lowercasing
Probably the most simple form of text preprocessing where we ensure all of our text is written in lowercase. This technique is applicable to many text mining and Natural Language Processing task and is extremely useful when our dataset is small.
It’s important to note that though lowercasing is often a standard practice, there will be occasions where preserving the capitalization is important.
Stemming
When we "stem" an inflection –In linguistic morphology, inflection is a process of word formation, in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness (Source: Wikipedia). For example, who becomes whose— word we are reducing it to it’s root form.
There are many different algorithms for stemming, however the most common and empircally effective stemmer for the English language is Porters Algorithm.
Stemming is often useful for dealing with sparsity and/or standardizing vocabulary.
Lemmatization
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma (Source: Standford NLP Group).
Simply, lemmatization aims to remove inflections and map a word to it’s root form the proper way.
Note: Do not make the mistake of using stemming and lemmatization interchangably – Lemmatization does morphological analysis of the words.
Stopwords
Stopwords are the common words in a language. These words are often regarded as words that do not add any meaning to the text (they aren’t important), hence why we remove them.
Removing stopwords isn’t always an effective strategy. There are some cases where removing stopwords tend to be useful such as topic extraction, but in various classification task we may derive useful insights by allowing stopwords to remain.
Normalization
In environments where text may have lots of noise, for example twitter and text messaging, normalizing text tends to be an important although overlooked step – By noisy environments, I mean places where being informal is common. When we normalize text we are transforming the text into a standard form (i.e. nvm becomes nevermind).

Text normalization, like stemming and lemmatization, has no standard method. It is purely dependant on the task at hand because we wouldn’t normalize our text messages in the same way we’d normalize our notes from a lecture (considering we take notes in non-standard ways).
Noise Removal
Noise can severly interfere with our text analysis. For example, tweets often contain all manner special characters that could harm our results as we do further preprocessing steps.

There are various forms of noise to remove from our text;
- Special Characters
- Numbers
- HTML
- Domain Specific Keywords (e.g. RT meaning Retweet on Twitter)
- Others (There are many more)
Which ones we remove is domain specific and what is determined as "noise" for the task that we have at hand.
Note: For More on Text Preprocessing, I highly recommend you read "All You Need To Know About Text Preprocessing for NLP and Machine Learning" by Kavita Ganesan.
Features Extraction
Before we can pass our text to a Logistic Regression Model, we must first represent our text as a vector. There are many ways for us to represent our text as a vector, but for our task (Sentiment analysis), we will look at two vector representations;
- One Hot Encoding
- Positive & Negative Frequency
One-Hot Encoding
For us to to this we must create a vocabulary. This can be done by creating a list of the unique words from every single tweet in our data.

For us to extract the features, take a tweet and mark it with "1" to indicate the word from our vocabulary appears in the tweet and "0" if the word from our vocabulary does not appear in the tweet – see Figure 9.

Since the vector for our tweet will be of length V (all the unique words in our dataset) and their will be only 5 features with a value 1 for the particular tweet we have chosen to display ("I hate going to school"), but many 0’s (length V – 5), we have what is called a sparse vector representation – simply means we have a large number of zeros hence we are taking up unwanted space to store the zeros.
If we train our Logistic Regression model on our sparse representation, our model would have to learn n + 1 (for the bias) parameters, where n is equal to the size of our vocabulary, V. As V becomes larger and larger, we would face two major problems;
- Long time to train the model
- Long inference time
Positive & Negative Frequencies
One technique to overcome the sparse representation problem is transform the vector to a positive and negative frequency count. More specifically, given a word, we want to track the number of times the word appeared in the positive class and given another word, track the number of times that word appears in the negative class. With these counts we can extract features and use them as input features to our Logisitc regression model.
In order to perform the positive and negative frequencies technique, we must first create a frequency dictionary – A frequency dictionary is simply a mapping of the counts of words given the target label. For example, we have our vocabulary and we count the number of times a word appears in positive tweets and we do the same for negative tweets.

To convert this into a feature, we simply take the sum of the positive frequencies and then we take the sum of the negative frequencies for each word in the tweet – See Figure 11.
![Figure 11: For each tweet the input feature would be [bias, positive word frequency, negative word frequency].](https://towardsdatascience.com/wp-content/uploads/2020/10/1wgZflJwxybV3sQbHMyQyhQ.png)
So we have a visual example of what this looks like, we will take the example tweet "I am sad, I am not tall" (therefore Xm = "I am sad, I am not tall"). In Figure 5, we can see the frequencies of words in the whole dataset for the positive and negative class, so all we must do is take our tweet and sum the number of times each word appears – see Figure 12.

Therefore the input features for our Logisitic Regression model would be [1 (bias), 4 (PostiveWordCount), 10 (NegativeWordCount)].
Wrap Up
From this post, you now know various preprocessing methods and two ways we can extract features for us to pass into a Logisitc Regression model. A good task to practice what you’ve learnt today would be to try this on real data.
Let’s continue the conversation on LinkedIn: