The world’s leading publication for data science, AI, and ML professionals.

Text Normalization

Why, what and how.

Image by Markus Winkler from Unsplash, edited by the author.
Image by Markus Winkler from Unsplash, edited by the author.

In the last few articles we spent some time explaining and implementing some of the most important preprocessing techniques in NLP. However, we played too little with real text situations. Now it is the time to work a little with that.

We talked about Text Normalization in the article about stemming. However, stemming is not the most important (and even used) task in Text Normalization. We also went on into some other Normalization techniques earlier, such as Tokenization, Sentencizing and Lemmatization. But there are other small tasks used to do this important preprocessing step that are going to be discussed in this article.

Keep in mind that there is no "correct" set or list of Normalization tasks that work for all situations. In fact, as we dive deeper in NLP, more and more we get to know that NLP is not as general as one may think. While there are many interesting general purpose toolboxes and premade pipelines, the more precise systems are the ones tailored to the context.

Therefore, take the list of normalization steps presented in this article as not hard rules, but instead as guidelines for doing text Normalization. It is also important to point out that, in some rare cases, you might not want to normalize the input – these cases are those where more variation and wrongness is important or vital (consider a test correction algorithm, for example).

Understanding our targets – why we need normalization

Lets start of by defining clearly where we want to get with the use of normalization techniques. Natural language, as a human resource, tends to follow the inherent nature of its creator randomness. This means that, as we "produce" natural language, we imprint our random states to it. Computers are not so good at dealing with randomness (although this is being minimized with the use of Machine Learning Algorithms).

When we normalize a natural language resource, we attempt to reduce the randomness in it, bringing it closer to a predefined "standard". This helps into reducing the amount of different information that the computer has to deal, and therefore improves efficiency.

By normalizing, we want to bring "text distributions" closer to the "Normal" distribution. Image taken from Wikipedia.
By normalizing, we want to bring "text distributions" closer to the "Normal" distribution. Image taken from Wikipedia.

When we normalize a natural language resource, we attempt to reduce the randomness in it

In that article about Stemming, I mentioned that normalization attempts to bring things closer to the ‘normal distribution’. That’s true in a sense where when we normalize a natural language input, we’re looking to make things ‘behave as expected’, in a ‘good’ and ‘predictable’ shape, like the probabilities distributions that follow the Normal Distribution.

Mathematics aside, we can discuss about the benefits of having normalized inputs into our NLP systems.

First of all, by reducing the variations, we have less input variables to treat and deal with, improving overall performance and avoiding false negatives (imagine the case of a software log line that would trigger a warning if there wasn’t a typo in it). This is very true for expert systems and for Information Retrieval tasks (imagine if Google’s search engine only matched the exact words that you typed!).

In some sense, normalization could be compared to the "removal of sharp edges". Image from Architect of the Capitol.
In some sense, normalization could be compared to the "removal of sharp edges". Image from Architect of the Capitol.

Second, especially when talking about machine learning algorithms, normalization reduces the dimensionality of the input, if we’re using plain old structures like Bags of Words or TF-IDF dicts; or lowers the amount of processing needed for creating embeddings.

Third, normalization helps to deal with code-breaking inputs before they are passed to our decision making NLP algorithm. In this case, we ensure that our inputs will follow a "contract" before being treated.

Finally, if done correctly, normalization is very important in granting reliable extraction of Statistics from our natural language inputs – as in other areas (such as time series analysis), normalization is an important step into the hands of a NLP Data Scientist/Analyst/Engineer.

What do we want to Normalize?

That is an important question. When doing text normalization, we should know exactly what do we want to normalize and why. Also, the purpose of the input helps shaping the steps we’re going to apply to normalize our input. There are two things that we are interested in normalizing the most:

  • Sentence structure: should it always end with a punctuation? Can there be repeated punctuation signals? Should we even remove all punctuation? Also, more specific structure can be used (like just attaining to subject verb object), but is harder to achieve.
  • Vocabulary: that’s one of the core things to pay attention. Most of the times, we want our vocabulary to be as smaller as possible. The reason is that, in NLP, words are our key features, and when we have less variation in these, we can achieve our objectives better.

In practice, we can do normalization over these two aspects by breaking into simpler problems. Here’s a list of the most common ones:

→ Removal of duplicate whitespaces and punctuation.

→ Accent removal (if your data includes diacritical marks from ‘foreign’ languages – this helps to reduce errors related to encoding type).

→ Capital letter removal (often, working with lowercase words deliver better results. In some cases, however, capital letters are very important to extract information, like names and locations).

→ Removal or substitution of special characters/emojis (e.g.: remove hashtags).

→ Substitution of contractions (very common in English; e.g.: ‘I’m’→’I am’).

→ Transform word numerals into numbers (eg.: ‘twenty three’→’23’).

→ Substitution of values for their type (e.g.: ‘$50’→’MONEY’).

→ Acronym normalization (e.g.: ‘US’→’United States’/’U.S.A’) and abbreviation normalization (e.g.: ‘btw’→’by the way’).

→ Normalize date formats, social security numbers or other data that have a standard format.

→ Spell correction (one could say that a word can be misspelled infinite ways, so spell corrections reduce the vocabulary variation by "correcting") – this is very important if you’re dealing with open user inputs, such as tweets, IMs and emails.

→ Removal of gender/time/grade variation with Stemming or Lemmatization.

→ Substitution of rare words for more common synonyms.

→ Stop word removal (more a dimensionality reduction technique than a normalization technique, but let us leave it here for the sake of mentioning it).

For this article, I’ll discuss the implementation of just a few of them.

How to do Normalization

To choose which normalization steps we’re going to use, we need a specific task. For this article, we’ll suppose that we want to extract the sentiment of a set of 3000 tweets for the #COVIDIOTS hashtag, extracted during the end of March 2020, to know how people was behaving regarding to the pandemic of COVID-19 around the world.

I went ahead and got these tweets, which can be downloaded here. I also went on to censor the text for curses using this nifty tool named better-profanity, which can be added to your normalization pipeline, if you want . They also do not contain the person who wrote the content.

However, I did not went on with the effort of removing names or checking for any political position, fakes, etc, inside each tweet, since this is not the purpose of this article and could take another entire article on its own (about automated censoring).

What I want to make clear is that I do not take responsibility for the content of the tweets, since they were consciously made publicly available by their authors on the moment the decided to post their words on the twitter. I just batch downloaded the tweets. That considered, let us continue (below is the link for a nice article that teaches how to mine tweets using python).

Get and Work With Twitter Data in Python Using Tweepy

For this case in particular, we want to apply the following steps: Removal of duplicate white space and punctuation; substitution of contractions; spell correction. Also, we’ve already discussed lemmatization . So, we’re using it.

After we’re through the code part, we’ll analyse the results of applying the mentioned normalization steps statistically.

One import thing about normalization is that the order of the functions matter. We could say that Normalization is a pipeline within the NLP preprocessing pipeline. If we’re not careful, we can remove information that is important for future steps (such as removing stopwords before lemmatizing).

Like in a production line, the order of the Normalization steps matter. Image by Pattama Pon at Pinterest.
Like in a production line, the order of the Normalization steps matter. Image by Pattama Pon at Pinterest.

We could even divide these steps into two consecutive groups: "pre tokenization steps" (for steps that modify sentence structure) _ and "post tokenization steps"_ (for steps that only modify individual tokens), to avoid duplicating tokenization steps. However, for the sake of simplicity, we’re using a simple .split() function.

After we’ve parsed our tweets into a list of strings, we can start creating the functions. Btw, I’m using this nifty module called tqdm around the lists so we have nice progress bars once we apply the normalization process. Here are the needed imports:

from symspellpy.symspellpy import SymSpell, Verbosity
import pkg_resources
import re, string, json
import spacy
from tqdm import tqdm
#Or, for jupyter notebooks:
#from tqdm.notebook import tqdm

Removal of duplicate white space and duplicate punctuation (and urls):

  • Done with simple regex replace. There’s room for improvement, but does what we expect (this way, we don’t have multiple sizes of reticence and exclamation point markings). We remove URLs since this reduces a lot the number of distinct tokens we have (we do it first since punctuation replace may kill it).

    Substitution of contractions:

  • Using a list of contractions from Wikipedia, we loop through the sentences and replace the contractions for their actual words (this benefits from happening before tokenization, since one token is broken into two). This helps in better sentence structuring later. The list can be downloaded here.

    Spell Correction:

  • Now, this is a tricky one. It can (and will) cause some unwanted changes (most spell correcting dictionaries lack important contextual words, so they consider them as misspells). So you have to use it consciously. There are many ways to do it. I chose to use a module named symspellpy, which is really fast (this matters a lot!) and does the job reasonably well. Another way to do it, is to train a deep learning model to spell correction based on context, but this is another story entirely.

    Lemmatization:

  • If you’ve been following along my series, you’ve known that I’ve implemented my own lemmatizer. However, for the sake of simplicity, I chose to use good old spaCy here. It is fast and straightforward, but you can use any other tool you want. I also decided to remove (replace) any hashtag and mention here. For sentiment analysis, we wouldn’t really need them.

    Finally, we join all steps in a "pipeline" function:

    Normalization pipeline running in a Google Colab Notebook. Image by author.
    Normalization pipeline running in a Google Colab Notebook. Image by author.

Results

So, you might be wondering: what are the results of applying these tasks? I’ve run a few counting functions and plotted some charts to help in explaining, but I have to be clear about one thing: numbers are not the best way to express the importance of text normalization.

Rather, Text normalization plays its role the best when applied to downstream NLP applications by improving efficiency, accuracy and other relevant scores. I’ll point to some benefits that we can clearly see by statistics.

First, we can clearly see a reduction in the total number of distinct tokens. In this specific case, we reduced the number of tokens by about 32%.

After applying Normalization to our data, we reduced the number of tokens by about 32%. Image by author.
After applying Normalization to our data, we reduced the number of tokens by about 32%. Image by author.
Distinct words in unnormalized: 15233–80% of the text correspond to 4053 distinct words. 
Distinct words in normalized: 10437–80% of the text correspond to 1251 distinct words.

Now, a bigger difference happens in the number of common tokens. These tokens are those which correspond to about 80% of all tokens. Usually, we have a range of about 10–20% of tokens that make the gross 80% of the text.

By applying normalization, we reduced the number of most common tokens by 69%! That is a lot! This also means that any machine learning technique that we plug this data to will be able to generalize better.

After normalization, the number of most common tokens reduced by 69%. Image by author.
After normalization, the number of most common tokens reduced by 69%. Image by author.

Now, one important thing about text normalization is that, for it to be useful, the normalized text has to retain default Natural Language structure. We can see that by the data distribution itself. One example is that, if done properly, sentences will not be much smaller or bigger after normalization.

This is presented in the following histograms, that shows that, although we have less 1-sized sentences and more 2-sized sentences after normalization, the rest of the distribution follows the structure of the unnormalized data (also, note that our curve tends to be slightly closer to the Normal distribution curve).

Normalization had little impact in overall sentence structure. Image by author.
Normalization had little impact in overall sentence structure. Image by author.

Another tool that help us to visualize this is the Boxplot. It shows how our data is distributed, including means, quartiles and outliers. In summary, we want our median line to be the same (or as close) as that of our unnormalized data. We also want that our box (the distribution of most of our data) remains in a similar place. If we are able to increase the size of the box, this means that we have more data cluttered around the median than before normalization (which is good). Also, we want to reduce outliers (those dots that are outside the range of our whiskers).

After Normalization, we were able to increase the interquartile range (where most tokens are). We also kept the same median line and reduced outliers. This means we did not break our text, but made it less complex =). Image by author.
After Normalization, we were able to increase the interquartile range (where most tokens are). We also kept the same median line and reduced outliers. This means we did not break our text, but made it less complex =). Image by author.

If you want to learn how I achieved these results, while at the same time being able to access and run all the code I mentioned above, check the following Colab Notebook:

Google Colaboratory

Conclusion

In this article, I expect to have been able to explain what is Text Normalization, Why should we do it and How to do it. Also, I attempted to present some proof that it works (without presenting its benefits, yet).

If you’ve been following along with the series and is asking yourself about the set of tools that I’m developing and if I added normalization to it, the answer is yes! I just didn’t use it here to make it simpler to explain. But in short, I added functionalities that allow (most) of the mentioned normalization steps to be applied directly to our Document or Sentence structures (and use our tokenization tools that we developed earlier).

You can see where I’m at by looking at the state of the following commit:

Sirsirious/NLPTools

Now that we have many tools, we have to start applying them. But before, how can we turn text into features for Machine Learning Algorithms? This is the topic of the next article!

Here are a couple links and a document for extra research:

Text Normalization

Encoder-Decoder Methods for Text Normalization


Related Articles