Introduction to Natural Language Processing for Text

Published in

Towards Data Science

16 min readNov 17, 2018

After reading this blog post, you’ll know some basic techniques to extract features from some text, so you can use these features as input for machine learning models.

What is NLP (Natural Language Processing)?

NLP is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text and speech.

For example, we can use NLP to create systems like speech recognition, document summarization, machine translation, spam detection, named entity recognition, question answering, autocomplete, predictive typing and so on.

Nowadays, most of us have smartphones that have speech recognition. These smartphones use NLP to understand what is said. Also, many people use laptops which operating system has a built-in speech recognition.

Some Examples

Cortana

The Microsoft OS has a virtual assistant called Cortana that can recognize a natural voice. You can use it to set up reminders, open apps, send emails, play games, track flights and packages, check the weather and so on.

You can read more for Cortana commands from here.

Siri

Siri is a virtual assistant of the Apple Inc.’s iOS, watchOS, macOS, HomePod, and tvOS operating systems. Again, you can do a lot of things with voice commands: start a call, text someone, send an email, set a timer, take a picture, open an app, set an alarm, use navigation and so on.

Here is a complete list of all Siri commands.

Gmail

The famous email service Gmail developed by Google is using spam detection to filter out some spam emails.

Introduction to the NLTK library for Python

NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to many corpora and lexical resources. Also, it contains a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Best of all, NLTK is a free, open source, community-driven project.

We’ll use this toolkit to show some basics of the natural language processing field. For the examples below, I’ll assume that we have imported the NLTK toolkit. We can do this like this: import nltk.

The Basics of NLP for Text

In this article, we’ll cover the following topics:

Sentence Tokenization
Word Tokenization
Text Lemmatization and Stemming
Stop Words
Regex
Bag-of-Words
TF-IDF

1. Sentence Tokenization

Sentence tokenization (also called sentence segmentation) is the problem of dividing a string of written language into its component sentences. The idea here looks very simple. In English and some other languages, we can split apart the sentences whenever we see a punctuation mark.

However, even in English, this problem is not trivial due to the use of full stop character for abbreviations. When processing plain text, tables of abbreviations that contain periods can help us to prevent incorrect assignment of sentence boundaries. In many cases, we use libraries to do that job for us, so don’t worry too much for the details for now.

Example:

Let’s look a piece of text about a famous board game called backgammon.

Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.

To apply a sentence tokenization with NLTK we can use the nltk.sent_tokenize function.

As an output, we get the 3 component sentences separately.

Backgammon is one of the oldest known board games.

Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.

It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.

2. Word Tokenization

Word tokenization (also called word segmentation) is the problem of dividing a string of written language into its component words. In English and many other languages using some form of Latin alphabet, space is a good approximation of a word divider.

However, we still can have problems if we only split by space to achieve the wanted results. Some English compound nouns are variably written and sometimes they contain a space. In most cases, we use a library to achieve the wanted results, so again don’t worry too much for the details.

Example:

Let’s use the sentences from the previous step and see how we can apply word tokenization on them. We can use the nltk.word_tokenize function.

Output:

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']

['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.']

['It', 'is', 'a', 'two', 'player', 'game', 'where', 'each', 'player', 'has', 'fifteen', 'checkers', 'which', 'move', 'between', 'twenty-four', 'points', 'according', 'to', 'the', 'roll', 'of', 'two', 'dice', '.']

Text Lemmatization and Stemming

For grammatical reasons, documents can contain different forms of a word such as drive, drives, driving. Also, sometimes we have related words with a similar meaning, such as nation, national, nationality.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Source: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Examples:

am, are, is => be
dog, dogs, dog’s, dogs’ => dog

The result of this mapping applied on a text will be something like that:

the boy’s dogs are different sizes => the boy dog be differ size

Stemming and lemmatization are special cases of normalization. However, they are different from each other.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Source: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

The difference is that a stemmer operates without knowledge of the context, and therefore cannot understand the difference between words which have different meaning depending on part of speech. But the stemmers also have some advantages, they are easier to implement and usually run faster. Also, the reduced “accuracy” may not matter for some applications.

Examples:

The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
The word “play” is the base form for the word “playing”, and hence this is matched in both stemming and lemmatization.
The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context; e.g., “in our last meeting” or “We are meeting again tomorrow”. Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.

After we know what’s the difference, let’s see some examples using the NLTK tool.

Output:

Stemmer: seen
Lemmatizer: see

Stemmer: drove
Lemmatizer: drive

Stop words

Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.

Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application.

The NLTK tool has a predefined list of stopwords that refers to the most common words. If you use it for your first time, you need to download the stop words using this code: nltk.download(“stopwords”). Once we complete the downloading, we can load the stopwords package from the nltk.corpus and use it to load the stop words.

Output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Let’s see how we can remove the stop words from a sentence.

Output:

['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']

If you’re not familiar with the list comprehensions in Python. Here is another way to achieve the same result.

However, keep in mind that list comprehensions are faster because they are optimized for the Python interpreter to spot a predictable pattern during looping.

You might wonder why we convert our list into a set. Set is an abstract data type that can store unique values, without any particular order. The search operation in a set is much faster than the search operation in a list. For a small number of words, there is no big difference, but if you have a large number of words it’s highly recommended to use the set type.

If you want to learn more about the time consuming between the different operations for the different data structures you can look at this awesome cheat sheet.

Regex

A regular expression, regex, or regexp is a sequence of characters that define a search pattern. Let’s see some basics.

. - match any character except newline
\w - match word
\d - match digit
\s - match whitespace
\W - match not word
\D - match not digit
\S - match not whitespace
[abc] - match any of a, b, or c
[^abc] - not match a, b, or c
[a-g] - match a character between a & g

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually, patterns will be expressed in Python code using this raw string notation.

Source: https://docs.python.org/3/library/re.html?highlight=regex

We can use regex to apply additional filtering to our text. For example, we can remove all the non-words characters. In many cases, we don’t need the punctuation marks and it’s easy to remove them with regex.

In Python, the re module provides regular expression matching operations similar to those in Perl. We can use the re.sub function to replace the matches for a pattern with a replacement string. Let’s see an example when we replace all non-words with the space character.

Output:

'The development of snowboarding was inspired by skateboarding  sledding  surfing and skiing '

A regular expression is a powerful tool and we can create much more complex patterns. If you want to learn more about regex I can recommend you to try these 2 web apps: regexr, regex101.

Bag-of-words

Source: https://www.iconfinder.com/icons/299088/bag_icon

Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors of numbers. This is called feature extraction.

The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.

To use this model, we need to:

Design a vocabulary of known words (also called tokens)
Choose a measure of the presence of known words

Any information about the order or structure of words is discarded. That’s why it’s called a bag of words. This model is trying to understand whether a known word occurs in a document, but don’t know where is that word in the document.

The intuition is that similar documents have similar contents. Also, from a content, we can learn something about the meaning of the document.

Example

Let’s see what are the steps to create a bag-of-words model. In this example, we’ll use only four sentences to see how this model works. In the real-world problems, you’ll work with much bigger amounts of data.

1. Load the Data

Source: https://www.iconfinder.com/icons/315166/note_text_icon

Let’s say that this is our data and we want to load it as an array.

To achieve this we can simply read the file and split it by lines.

Output:

["I like this movie, it's funny.", 'I hate this movie.', 'This was awesome! I like it.', 'Nice one. I love it.']

2. Design the Vocabulary

Source: https://www.iconfinder.com/icons/2109153/book_contact_dairy_google_service_icon

Let’s get all the unique words from the four loaded sentences ignoring the case, punctuation, and one-character tokens. These words will be our vocabulary (known words).

We can use the CountVectorizer class from the sklearn library to design our vocabulary. We’ll see how we can use it after reading the next step, too.

3. Create the Document Vectors

Source: https://www.iconfinder.com/icons/1574/binary_icon

Next, we need to score the words in each document. The task here is to convert each raw text into a vector of numbers. After that, we can use these vectors as input for a machine learning model. The simplest scoring method is to mark the presence of words with 1 for present and 0 for absence.

Now, let’s see how we can create a bag-of-words model using the mentioned above CountVectorizer class.

Output:

Here are our sentences. Now we can see how the bag-of-words model works.

Additional Notes on the Bag of Words Model

Source: https://www.iconfinder.com/icons/1118207/clipboard_notes_pen_pencil_icon

The complexity of the bag-of-words model comes in deciding how to design the vocabulary of known words (tokens) and how to score the presence of known words.

Designing the Vocabulary
When the vocabulary size increases, the vector representation of the documents also increases. In the example above, the length of the document vector is equal to the number of known words.

In some cases, we can have a huge amount of data and in this cases, the length of the vector that represents a document might be thousands or millions of elements. Furthermore, each document may contain only a few of the known words in the vocabulary.

Therefore the vector representations will have a lot of zeros. These vectors which have a lot of zeros are called sparse vectors. They require more memory and computational resources.

We can decrease the number of the known words when using a bag-of-words model to decrease the required memory and computational resources. We can use the text cleaning techniques we’ve already seen in this article before we create our bag-of-words model:

Ignoring the case of the words
Ignoring punctuation
Removing the stop words from our documents
Reducing the words to their base form (Text Lemmatization and Stemming)
Fixing misspelled words

Another more complex way to create a vocabulary is to use grouped words. This changes the scope of the vocabulary and allows the bag-of-words model to get more details about the document. This approach is called n-grams.

An n-gram is a sequence of a number of items (words, letter, numbers, digits, etc.). In the context of text corpora, n-grams typically refer to a sequence of words. A unigram is one word, a bigram is a sequence of two words, a trigram is a sequence of three words etc. The “n” in the “n-gram” refers to the number of the grouped words. Only the n-grams that appear in the corpus are modeled, not all possible n-grams.

Example
Let’s look at the all bigrams for the following sentence:
The office building is open today

All the bigrams are:

the office
office building
building is
is open
open today

The bag-of-bigrams is more powerful than the bag-of-words approach.

Scoring Words
Once, we have created our vocabulary of known words, we need to score the occurrence of the words in our data. We saw one very simple approach - the binary approach (1 for presence, 0 for absence).

Some additional scoring methods are:

Counts. Count the number of times each word appears in a document.
Frequencies. Calculate the frequency that each word appears in document out of all the words in the document.

TF-IDF

One problem with scoring word frequency is that the most frequent words in the document start to have the highest scores. These frequent words may not contain as much “informational gain” to the model compared with some rarer and domain-specific words. One approach to fix that problem is to penalize words that are frequent across all the documents. This approach is called TF-IDF.

TF-IDF, short for term frequency-inverse document frequency is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.

The TF-IDF scoring value increases proportionally to the number of times a word appears in the document, but it is offset by the number of documents in the corpus that contain the word.

Let’s see the formula used to calculate a TF-IDF score for a given term x within a document y.

TF-IDF Formula. Source: http://filotechnologia.blogspot.com/2014/01/a-simple-java-class-for-tfidf-scoring.html

Now, let’s split this formula a little bit and see how the different parts of the formula work.

Term Frequency (TF): a scoring of the frequency of the word in the current document.

Term Frequency Formula

Inverse Term Frequency (ITF): a scoring of how rare the word is across documents.

Inverse Document Frequency Formula

Finally, we can use the previous formulas to calculate the TF-IDF score for a given term like this:

TF-IDF Formula

Example
In Python, we can use the TfidfVectorizer class from the sklearn library to calculate the TF-IDF scores for given documents. Let’s use the same sentences that we have used with the bag-of-words example.

Output:

Again, I’ll add the sentences here for an easy comparison and better understanding of how this approach is working.

Summary

In this blog post, you learn the basics of the NLP for text. More specifically you have learned the following concepts with additional details:

NLP is used to apply machine learning algorithms to text and speech.
NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data
Sentence tokenization is the problem of dividing a string of written language into its component sentences
Word tokenization is the problem of dividing a string of written language into its component words
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
Stop words are words which are filtered out before or after processing of text. They usually refer to the most common words in a language.
A regular expression is a sequence of characters that define a search pattern.
The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.
TF-IDF is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.

Awesome! Now we know the basics of how to extract features from a text. Then, we can use these features as an input for machine learning algorithms.

Do you want to see all the concepts used in one more big example?
- Here you are! If you’re reading from mobile, please scroll down to the end and click the “Desktop version” link.

Resources

Interactive Version

Here is an interactive version of this article uploaded in Deepnote (cloud-hosted Jupyter Notebook platform). Feel free to check it out and play with the examples.

Final Words

Thank you for the read. I hope that you have enjoyed the article. If you like it, please hold the clap button and share it with your friends. I’ll be happy to hear your feedback. If you have some questions, feel free to ask them. 😉

Introduction to Natural Language Processing for Text

What is NLP (Natural Language Processing)?

Some Examples

Introduction to the NLTK library for Python

The Basics of NLP for Text

1. Sentence Tokenization

2. Word Tokenization

Text Lemmatization and Stemming

Stop words

Regex

Bag-of-words

Example

Additional Notes on the Bag of Words Model

TF-IDF

Summary

Resources

Interactive Version

Other Blog Posts by Me

Newsletter

LinkedIn

Final Words

Written by Ventsislav Yordanov