The world’s leading publication for data science, AI, and ML professionals.

A Natural Language Processing (NLP) Primer

Overview of common NLP tasks using Python

Photo by Skylar Kang from Pexels
Photo by Skylar Kang from Pexels

Natural Language Processing NLP is used to analyse textual data. This could be data from web sites, scanned documents, books, journals, Tweets, YouTube comments to name but a few sources.

This primer introduces some of the common NLP tasks that can be carried out using Python. Examples mostly use the Natural Language ToolKit (NLTK) and scikit learn packages. It is assumed that you have a basic working knowledge of Python and data science principles.

Natural language refers to languages like English, French, Arabic and Chinese as opposed to computer languages like Python, R and C++. NLP automates parts of the analysis of textual data which was only previously possible with qualitative methods. These qualitative methods like framework/thematic analysis do not scale to large quantities of textual data. This is where NLP comes in. It is also used to create things like chatbots, digital assistants (e.g. Siri and Alexa) among other uses.

The data used in this notebook is derived from a sub-sample of data from the https://www.english-corpora.org/corona/ [1]. The data is about the coronavirus pandemic and represents a sub-set from various media sources (e.g. newspapers, websites) and is for the period Jan-May 2020. This data contains around 3.2 million English language words.


Loading text data with Python

Let’s start by just taking a look at the text data to see what it looks like. We can use the operating system (os) library to list all of the files in our text folder. In this case the files are located in a folder called text inside a folder called NLP which is relative to the Python source code file (e.g. a Jupyter notebook or .py file) and is expressed with a "./".

import os
os.listdir("./NLP/text")

Which produces a list of 5 text files:

>>> ['20–01.txt', '20–02.txt', '20–03.txt', '20–04.txt', '20–05.txt']

As you can see there are 5 text files (*.txt). We can use the standard Python file handling functions to open the file. In this case we will limit this to the first 10 lines to get a look at how the file is structured and what sort of information it contains. This is shortened for brevity in the output below. The next function is used to iterate through lines in the file.

with open("./NLP/text/20–01.txt") as txt_file:
    head = [next(txt_file) for i in range(10)]
print(head)
>>> ['n', '@@31553641 <p> The government last July called the energy sector debt situation a " state of emergency . " <p> This was during the mid-year budget review during which the Finance Minister Ken Ofori-Atta castigated the previous NDC government for entering into " obnoxious take-or-pay contracts signed by the NDC , which obligate us to pay for capacity we do not need . " <p> The government pays over GH ? 2.5 billion annually for some 2,300MW in installed capacity which the country does not consume . <p> He sounded alarmed that " from 2020 if nothing is done , we will be facing annual excess gas capacity charges of between $550 and $850 million every year . <p> JoyNews ' Business editor George Wiafe said the latest IMF Staff report expressing fears over a possible classification is " more of a warning " to government . <p> He said the latest assessment raises concerns about the purpose of government borrowings , whether it goes into consumption or into projects capable of generating revenue to pay back the loan . <p> The move could increase the country 's risk profile and ability to borrow on the international market . <p> @ @ @ @ @ @ @ @ @ @ issue another Eurobond in 2020 . <p> The Finance Minister Ken Ofori-Atta wants to return to the Eurobond market to raise $3bn to pay for expenditure items the country can not fund from domestic sources . <p> The government wants to spend GH ? 86m in 2020 but is projecting to raise only GH ? 67bn . It leaves a deficit of GH ? 19bn , monies that the Eurobond could make available . <p> The planned return to the Eurobond market is the seventh time in the past eight years . <p> Ghana is already among 10 low-income countries ( LICs ) in Africa that were at high risk of debt distress . <p> The country in April 2019 , successfully completion an Extended Credit Facility ( ECF ) programme , or bailout , of the International Monetary fund ( IMF ) . n']

Text representation

To store and use characters digitally, they are often represented with an encoding system. There are different character encoding standards such as ASCII (American Standard Code for Information Interchange). For example the letter ‘a’ is represented by the ASCII code 097, which is also 01100001 in binary. There are other encoding sets such as UTF-8 (Unicode (or Universal Coded Character Set) Transformation Format, 8-bit) that supports characters of variable widths. The letter ‘a’ in this system is U+0061. In Python, we can use UTF-8 directly like so:

u"u0061"
>>> 'a'

You may need to convert the character encoding of text data that you import to carry out further processing and to better represent certain symbols (e.g. emojis 🌝 ). You can check what sort of encoding you are using with the getdefaultencoding function in the sys module. To change the encoding you can use the setdefaultencoding function e.g. sys.setdefaultencoding("utf-8").

import sys
sys.getdefaultencoding()
>>> 'utf-8'

Data pre-processing

There are several stages to data processing when applying NLP. These differ depending on the exact context but usually follow a similar path to that shown in the image below. This generally involves accessing textual data whether in the form of web pages, Tweets, comments, PDF documents or raw text formats. This is then broken down into a representation that algorithms can readily work with (e.g. tokens representing individual words or letters), common words (stop words) are removed (e.g. "and", "or", "the"). Further normalisation is then followed by tasks like feature extraction and/or removing noise. Finally the various models and methods (e.g. topic modelling, sentiment analysis, neural networks etc.) are applied.

A common pre-processing approach to text data (image by author)
A common pre-processing approach to text data (image by author)

Since we already have some text to work with we can look at the next steps involved in pre-processing the data. We can start with tokenisation. Tokenisation can be done with many Python libraries including the machine learning library scikit learn. A popular library for NLP tasks is the Natural Language Tool Kit (Nltk):

import nltk

Note: Another powerful alternative library for NLP in Python is spaCy.

Tokenising

Even within libraries there are often different tokenisers to choose from. For example NLTK also has a RegexpTokenizer (uses regular expressions). Here we will use the TreebankWordTokenizer as this filters out some punctuation and whitespace.

from nltk.tokenize import TreebankWordTokenizer

We will take a short snippet of text from our text data to illustrate how this works. Here we store this as a string (a data type used to store textual data in Python) in a variable called txt.

txt = "The government last July called the energy sector debt situation a state of emergency. <p> This was during the mid-year budget review during which the Finance Minister Ken Ofori-Atta castigated the previous NDC government for entering into obnoxious take-or-pay contracts signed by the NDC , which obligate us to pay for capacity we do not need . <p> The government pays over GH ? 2.5 billion annually for some 2,300MW in installed capacity which the country does not consume ."

A simple first step might be to convert all this text into lower case. We can do this with the lower function.

txt = txt.lower()
txt
>>> 'the government last july called the energy sector debt situation a state of emergency. <p> this was during the mid-year budget review during which the finance minister ken ofori-atta castigated the previous ndc government for entering into obnoxious take-or-pay contracts signed by the ndc , which obligate us to pay for capacity we do not need .  <p> the government pays over gh ? 2.5 billion annually for some 2,300mw in installed capacity which the country does not consume .'

Next we can create an instance of the TreebankWordTokenizer class and use the tokenize function, passing in our txt variable. The output can be seen below (showing first 20).

tk = TreebankWordTokenizer()
tk_words = tk.tokenize(txt)
tk_words[:20]
>>> ['the',
     'government',
     'last',
     'july',
     'called',
     'the',
     'energy',
     'sector',
     'debt',
     'situation',
     'a',
     'state',
     'of',
     'emergency.',
     '<',
     'p',
     '>',
     'this',
     'was',
     'during']

The casual_tokenize is useful for social media tokenising as it deals with things like usernames and emoji’s well. The TweetTokenizer also keeps hash tags intact for Twitter analysis.

Dealing with stop words and punctuation

The next thing that is common to do is to remove stop words. These are common high frequency words that are used in sentence structure but don’t carry that much meaning for analysis purposes. These include words like "the", "and", "to", "a" etc. We can download a list of these words from the NLTK library and store them in a variable (sw) like so:

nltk.download("stopwords", quiet=True)
sw = nltk.corpus.stopwords.words("english")

We can take a look at the first 20 of these words.

sw[:20]
>>> ['i',
     'me',
     'my',
     'myself',
     'we',
     'our',
     'ours',
     'ourselves',
     'you',
     "you're",
     "you've",
     "you'll",
     "you'd",
     'your',
     'yours',
     'yourself',
     'yourselves',
     'he',
     'him',
     'his']

There are over 100 stop words in the list (179) at the time of writing this. You can use the len function to see how many words are in the list if you are interested (e.g. len(sw)).

Let’s now remove those stop words from our tokenised words. We can use a Python list comprehension to filter words in the tk_words list that are not in the stop words (sw) list.

tk_words_filtered_sw = [word for word in tk_words if word not in sw]

If you are not familiar with list comprehensions, they are essentially a concise method for list creation and iterating through the list data structure. This often avoids requiring several more lines of code with a separate "for loop". Let’s say I want to square the numbers 0 to 5. We could use a for loop like so:

squared_nums = []
for n in range(5):
    squared_nums.append(n**2)

Although this has the desired effect, we could create the list and iterate through it squaring the numbers using a list comprehension instead combining this into a single line of code:

squared_nums = [n**2 for n in range(5)]

If we output the contents of the tk_words_filtered_sw variable we can see a lot of these stop words have now been removed (abbreviated for brevity):

tk_words_filtered_sw
>>> ['government',
     'last',
     'july',
     'called',
     'energy',
     'sector',
     'debt',
     'situation',
     'state',
     'emergency.',
     '<',
     'p',
     '>',
     'mid-year',
     'budget',
     'review',
     'finance',
     'minister',
     'ken',
     'ofori-atta',
     'castigated',
     'previous',
     'ndc',
     'government',
     'entering',
     'obnoxious',
     'take-or-pay',
     'contracts',
     'signed',
     'ndc',
     ','
...

You can see that there are still punctuation symbols like full stops (periods), question marks and commas in the text. Again we would typically remove these from the list. This can be done in a number of different ways. Here we use the string library to remove the punctuation storing the result in a variable called no_punc.

import string
no_punc = ["".join( j for j in i if j not in string.punctuation) for i in  tk_words_filtered_sw]
no_punc
>>> ['government',
     'last',
     'july',
     'called',
     'energy',
     'sector',
     'debt',
     'situation',
     'state',
     'emergency',
     '',
     'p',
     '',
     'midyear',
     'budget'
...

We can then remove these empty strings (”) from the list by filtering them out. We can store the result of this a variable called filtered_punc.

filtered_punc = list(filter(None, no_punc))
filtered_punc
>>> ['government',
     'last',
     'july',
     'called',
     'energy',
     'sector',
     'debt',
     'situation',
     'state',
     'emergency',
     'p',
     'midyear',
     'budget'
...

Finally, we might also want to remove numbers from the list. To do this we can check if the strings contain any digits with the isdigit function.

str_list = [i for i in filtered_punc if not any(j.isdigit() for j in i)]

As you can see, text data can be very messy and require a lot of cleaning before you can start to run various analysis. Another common method to remove unnecessary words is to use either stemming or lemmatisation.

Stemming and lemmatization

Stemming refers to reducing words down to their root (stem) forms. For example the words "waited", "waits", "waiting" can be reduced to "wait". We can take a look at an example of a common stemmer called the Porter stemmer. First we can can create a short list of words to demonstrate how this works. This stage typically follows the tokenisation of words.

word_list = ["flying", "flies", "waiting", "waits", "waited", "ball", "balls", "flyer"]

Next we will import the Porter stemmer from NLTK.

from nltk.stem.porter import PorterStemmer

Make an instance of the PorterStemmer class called ps.

ps = PorterStemmer()

Finally use a list comprehension to stem each word in the list. The result of which can be seen below.

stem_words = [ps.stem(word) for word in word_list]
stem_words
>>> ['fli', 'fli', 'wait', 'wait', 'wait', 'ball', 'ball', 'flyer']

It is possible for over stemming to occur where words which should have different stems are stemmed to the same root. It is also possible to get under stemming too. This is essentially the opposite (words that should be stemmed to the same root are not). There are various different stemmers that can be used such as Porter, English Stemmer, Paice and Lovins to name a few. The Porter stemmer is one of the most widely used.

Some stemmers are "harsher" or more "gentle" than others so you may need to try different stemmers to get the desired results. You may also decide to apply stemming before or after stop word removal depending on the output.

Lemmatisation on the other hand works by identifying the meaning of a word in relation to the sentence that it appears in. Essentially context matters for Lemmatisation. The root word in this case is called a lemma. For example the word "better" could be represented by the word "good" using Lemmatisation as this is the word it it is derived from. This process is more computationally expensive than Stemming.

We need to first download the wordnet resource from NLTK (a large word database of English Nouns, Adjectives, Adverbs and Verbs).

nltk.download('wordnet', quiet=True)

We will them import and create an instance of the WordNetLemmatizer and apply this to the word list we used previously.

word_list
>>> ['flying', 'flies', 'waiting', 'waits', 'waited', 'ball', 'balls', 'flyer']
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
lem_words = [lm.lemmatize(word) for word in word_list]
lem_words
>>> ['flying', 'fly', 'waiting', 'wait', 'waited', 'ball', 'ball', 'flyer']

If we compare the lem_words to the stem_words (below), you can see that although similar/same in some cases, for some words like "flying" and "flies" the meaning is preserved using Lemmatisation, which is otherwise lost with stemming.

stem_words
>>> ['fli', 'fli', 'wait', 'wait', 'wait', 'ball', 'ball', 'flyer']

Normalisation/scaling

An alternative to removing features (words/terms) is to rescale the data using the tf-idf which stands for Term Frequency, Inverse Document Frequency.

tf-idf equation (image by author)
tf-idf equation (image by author)

This is used to see how important a term (word) is to an individual document in a collection of documents (corpus). We essentially use this as a weighting. The term frequency is the frequency of occurrences of that word/term in a document. For example if we had 2 documents 𝑑1 and 𝑑2 that looked like this.

𝑑1 = "The small boy is in the house." 𝑑2 = "The small boy is not in the house."

If the term of interest was "boy" in document 1 ( 𝑑1 ) this occurs once out of seven words 1/7=0.1428 . We can do the same for each term in each document. Once we have calculated the term frequency, we can multiply this by the logarithm of the total number of documents in a corpus over the number of documents containing the term. This tells us if a particular term is more, less or equally relevant to a document. In the example above the word "not" in document two is important in distinguishing the two documents. High weightings are due to a high term frequency in a documents and a low frequency of the term in the corpus (collection of documents). The TF-IDF can also be used for text summarization tasks.

To use TF-IDF in practice we are going to need to use the text files we saw earlier. To make this work a bit better I saved a version without stop words and HTML paragraph tags. This is also converted into lowercase. Essentially this opens the files one by one using a loop, reads them in line by line, splitting into words. If the word is not a stop word it is written to a new file. Finally the BeautifulSoup library is used to strip out all the HTML tags from the text data.

from bs4 import BeautifulSoup

file_path = "./NLP/text/"
file_list = ['20-01', '20-02', '20-03', '20-04']

for file in file_list:
    current_file = open(file_path + file + ".txt")
    line = current_file.read().lower()
    soup = BeautifulSoup(line)
    words = soup.get_text().split()
    for word in words:
        if word not in sw:
            formated_list = open((file_path + file + '-f.txt'),'a')
            formated_list.write(" "+word)
            formated_list.close()

    current_file.close()

We can now load these files and store their contents in variables.

path = "./NLP/text/processed/"
txt_file_1 = open(path + "20–01-f.txt") 
file_1 = txt_file_1.read()
txt_file_2 = open(path + "20–02-f.txt") 
file_2 = txt_file_2.read()
txt_file_3 = open(path + "20–03-f.txt")
file_3 = txt_file_3.read()
txt_file_4 = open(path + "20–04-f.txt") 
file_4 = txt_file_4.read()

We can place the text data extracted from the files in a list to make it easier to work with.

data_files = [file_1, file_2, file_3, file_4]

Next we will import the pandas library which is used often in data science and provides functionality to represent data in a tabular structure (data frame). Next from the sklearn machine learning library we will import the TfidfVectorizer that will tokenise documents and apply the IDF weightings.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Finally we can create an instance of the class and fit this to the data. We can display the results in a data frame with each feature and associated TF-IDF weight sorted in descending order. Finally for the sake of brevity, we will just output the first 30 terms.

tv = TfidfVectorizer(use_idf=True)
tfIdf = tv.fit_transform(data_files)
df = pd.DataFrame(tfIdf[0].T.todense(), index=tv.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
df.head(30)
Output of Pandas data frame - shortened for brevity (image by author)
Output of Pandas data frame – shortened for brevity (image by author)

Word frequencies

One of the simplest things we can do with word data is to look at the frequency of occurrences of a unique word in a document (and/or it’s cumulative frequency over a corpus). We can use the FreqDist function to compute a dictionary with key/value pairs containing each term (word) followed by the number of times it appears in the text. For example the word "government" appears 3 times in the original short sample text we used earlier.

dist = nltk.FreqDist(str_list)
dist
>>> FreqDist({'government': 3, 'p': 2, 'ndc': 2, 'capacity': 2, 'last': 1, 'july': 1, 'called': 1, 'energy': 1, 'sector': 1, 'debt': 1, ...})

To make this a bit easier to visualise we can output this as a plot. As you can see the word government occurs 3 times, the letter p (from the HTML paragraph tag) occurs twice and the other words occur a single time.

dist.plot();
Plot of words and frequency of occurrence (image by author)
Plot of words and frequency of occurrence (image by author)

Another visual way of displaying the frequency of words is to use a word cloud, here the larger the word, the more times they occur. To do this we can use the WordCloud library.

from wordcloud import WordCloud

We also need the plot (plt) from the matplotlib library which is used for various visualisations. The %matplotlib inline sets the plotting command to ensure that when used with a font end like a Jupyter notebook, the plot appears below the code cell and is stored in the notebook.

import matplotlib.pyplot as plt
%matplotlib inline

We next need to get the data into the correct format for the word cloud function. This accepts a string so we will collapse the list of tokenised strings back into a single string for this purpose with spaces between words using Pythons built in join function for string concatenation (joining strings together).

flattend_text = " ".join(str_list)

Next we can create the word cloud using the generate function passing in the string of text.

wc = WordCloud().generate(flattend_text)

Finally we will output the word cloud turning off the axis text using the bi-linear interpolation option that smooths the appearance of the image.

plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word cloud output (image by author)
Word cloud output (image by author)

There are other optional parameters you can use. Some of the popular ones include setting the maximum font size (max_font_size) and number of words to include (helpful if you have many words) as well as changing the background colour. For example we can set the maximum number of words to display to 10 and make the background white.

wc_2 = WordCloud(max_words = 10, background_color = "white").generate(flattend_text)
plt.imshow(wc_2, interpolation='bilinear')
plt.axis("off")
plt.show()
Word cloud output (image by author)
Word cloud output (image by author)

N-gram analysis

When we tokenise words and represent them as a bag of words we loose some of the context and meaning. Single words alone don’t tell that much, but the frequency at which they occur with other words might. For example the words "information" and "governance" might often be seen together and have a specific meaning. We can account for this using n-grams. This is refers to a number of tokens appearing together. These tokens can be words or letters. Here will we use words. Two words ( 𝑛=2 ) are called bi-grams, three words are called tri-grams and so on. Using n-grams helps us to retain some of the meaning/context in the text.

We will use the same short text snippet from earlier:

txt
>>> 'the government last july called the energy sector debt situation a state of emergency. <p> this was during the mid-year budget review during which the finance minister ken ofori-atta castigated the previous ndc government for entering into obnoxious take-or-pay contracts signed by the ndc , which obligate us to pay for capacity we do not need .  <p> the government pays over gh ? 2.5 billion annually for some 2,300mw in installed capacity which the country does not consume .'

Next we can import the ngrams function from the NLTK utilities package.

from nltk.util import ngrams

We will tokenise the text again using the same tokeniser as before.

tk = TreebankWordTokenizer()
tk_words = tk.tokenize(txt)

Finally we pass this tokenised list into the ngrams function and specify 𝑛 in this case 2 for bi-grams (abbreviated for brevity).

bigrams = list(ngrams(tk_words, 2))
bigrams
>>> [('the', 'government'),
     ('government', 'last'),
     ('last', 'july'),
     ('july', 'called'),
     ('called', 'the'),
     ('the', 'energy'),
     ('energy', 'sector'),
     ('sector', 'debt'),
     ('debt', 'situation'),
     ('situation', 'a'),
     ('a', 'state'),
     ('state', 'of'),
     ('of', 'emergency.'),
     ('emergency.', '<'),
     ('<', 'p'),
     ('p', '>'),
     ('>', 'this'),
     ('this', 'was'),
     ('was', 'during'),
     ('during', 'the'),
     ('the', 'mid-year'),
     ('mid-year', 'budget')
...

Here we can see terms like "contracts" and "signed" or "government" and "pays" which provide much more context than the individual words alone. You can also filter out n-grams from text data too as part of the pre-processing.

We can also determine the number of times the bigrams are present using the BigramCollocationFinder class. Here we sort the list in descending order.

from nltk.collocations import BigramCollocationFinder
finder = BigramCollocationFinder.from_words(tk_words, window_size=2)
ngram = list(finder.ngram_fd.items())
ngram.sort(key=lambda item: item[-1], reverse=True)
ngram
>>> [(('the', 'government'), 2),
     (('<', 'p'), 2),
     (('p', '>'), 2),
     (('which', 'the'), 2),
     (('government', 'last'), 1),
     (('last', 'july'), 1),
     (('july', 'called'), 1),
     (('called', 'the'), 1),
     (('the', 'energy'), 1),
     (('energy', 'sector'), 1),
     (('sector', 'debt'), 1),
     (('debt', 'situation'), 1),
     (('situation', 'a'), 1),
     (('a', 'state'), 1),
     (('state', 'of'), 1),
     (('of', 'emergency.'), 1),
     (('emergency.', '<'), 1),
     (('>', 'this'), 1),
     (('this', 'was'), 1),
     (('was', 'during'), 1),
     (('during', 'the'), 1),
     (('the', 'mid-year'), 1),
     (('mid-year', 'budget'), 1)
...

Sentiment analysis

This involves analysing text to determine how ‘positive’ or ‘negative’ the text is. This can give us information about peoples opinions/emotions. This can be applied to things like looking through product reviews to get an overall sense as to whether the product is seen in a good light or not. We can also use it for research purposes – for example, are people mostly positive or negative about mask wearing in their Tweets. To do this we can train a model or use an existing lexicon. The VADER (Valence Aware Dictionary and sEntiment Reasoner) is implemented in Python. This produces a score for positive, negative, neutral sentiment scored between -1 and +1. It also produces a compound score which is positive + neutral normalised (-1 to 1). To start, we will download the lexicon. A lexicon contains information (e.g. semantics or grammar) related to individual words or strings of text.

nltk.download("vader_lexicon", quiet=True)

We import the sentiment analyser class SentimentIntensityAnalyzer and create an instance called snt (for sentiment).

from nltk.sentiment.vader import SentimentIntensityAnalyzer
snt = SentimentIntensityAnalyzer()

We can then pass in some text data (e.g. the second of our files) to the function polarity_scores. You can see the scores returned below as key/value pairs in a dictionary data structure.

snt.polarity_scores(data_files[1])
>>> {'neg': 0.101, 'neu': 0.782, 'pos': 0.117, 'compound': 1.0}

In this example looking at the second text file, we can see a predominantly neutral sentiment score (0.782) followed by a positive score (0.117) and a negative score (0.101). You may compare sentiment at different points in time to see how it varies or between different sub groups. Another metric that is sometimes used is the Net Sentiment Score (NSS) which is calculated by subtracting negative score from the positive. First we need to store the scores in a variable to access them.

sent_scores = snt.polarity_scores(data_files[1])

Then we can subtract the negative score form the positive.

nss = sent_scores['pos'] - sent_scores['neg']
print("NSS =", nss)
>>> NSS = 0.016

Topic modelling

Topic modelling, often used for text mining allows us to create a statistical model to discover topics in textual data. There are various different approaches/algorithms for doing this. We will look at a couple here. This is an unsupervised method. We will first look at Latent Semantic Analysis (LSA) which works in the same was as Principle Component Analysis (PCA). The LSA is a linear model and assumes a normal distribution of terms in documents. It also uses SVD (Singular-Value Decomposition) which is computationally expensive. This method reduces noise in the data.

Note: SVD works by splitting a document-term matrix into 3 subsequent square matrices (one being diagonal) before transposing and multiplying them back together again. This method can be used to invert a matrix.

The first stage involves creating a document-term matrix. This represents the documents in rows, the terms in columns and the TF-IDF scores in the cells. We then apply SVD to this matrix to derive the final list of topics.

A document-term matrix with documents in the rows and terms in the columns. Cells contain TF-IDF scores (image by author)
A document-term matrix with documents in the rows and terms in the columns. Cells contain TF-IDF scores (image by author)

We can use the same TfidfVectorizer that we used before to compute the TF-IDF scores. We will limit the number of terms (features) to reduce the computational resources required to 800.

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer(stop_words='english', max_features=800, max_df=0.5)
X = v.fit_transform(data_files)

If we look at the dimensions of the document-term matrix we can see the 4 documents in the rows and 800 terms in the columns.

X.shape
>>> (4, 800)

We now need to carry out the SVD which we can use the TruncatedSVD class for. This will do the heavy lifting for us.

from sklearn.decomposition import TruncatedSVD

We can specify the number of topics with the n_components parameter. In this case we set it to 4 assuming there is a different main topic per document (note: we can also use methods like topic coherence to determine the optimal number of topics k).

svd = TruncatedSVD(n_components=4)

Next we fit the model and get the feature names:

svd.fit(X)
doc_terms = v.get_feature_names()

Now we can output the terms associated with each topic. In this case the first 12 for each of the 4 topics.

for i, component in enumerate(svd.components_):
    terms_comp = zip(doc_terms, component)
    sorted_terms = sorted(terms_comp, key=lambda x:x[1], reverse=True)[:12]
    print("")
    print("Topic "+str(i+1)+": ", end="")
    for term in sorted_terms:
        print(term[0], " ", end="")
>>> Topic 1: rsquo href ldquo rdquo ventilators easter keytruda quebec unincorporated ford books inmates 
Topic 2: davos wef sibley denly stamler comox nortje caf pd rsquo href ldquo 
Topic 3: hopland geely easyjet davos vanderbilt wef asbestos macy jamaat sibley denly stamler 
Topic 4: rsquo href ldquo rdquo quebec div eacute src noopener rel mdash rsv

It’s then up to you to work out what these topics might represent based on the words they contain (topic labelling).

Another popular option for topic modelling is Latent Dirichlet Allocation (LDA) not to be confused with the other LDA (Linear Discriminant Analysis). LDA assumes a Dirichlet distribution of words and creates a semantic vector space model. The exact details of its implementation are beyond this introductory primer but essentially it converts the document-term matrix into two matrices. One represents documents and topics the second topics and terms. The algorithm then tries to adjust topics for each word in each document based on a probability calculation that the topic generated the word in question.

We use the same idea as used for the LSA earlier. First we tokenise the data with the CountVectorizer tozeniser. Then we create an instance of the LDA class again setting the number of topics to 4 and outputting the top 12.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
fitted = cv.fit_transform(data_files)
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=4, random_state=42)
lda.fit(fitted)

We can then output the results in the same way as before:

doc_terms = cv.get_feature_names()
for i, component in enumerate(lda.components_):
   terms_comp = zip(doc_terms, component)
   sorted_terms = sorted(terms_comp, key=lambda x:x[1], reverse=True)[:12]
   print("")
   print("Topic "+str(i+1)+": ", end="")
   for term in sorted_terms:
       print(term[0], " ", end="")
>>> Topic 1: said  19  covid  people  coronavirus  new  health  also  would  one  pandemic  time  
Topic 2: 021  040  25000  421  4q  712  85th  885  accrues  accuser  acuity  afterthought  
Topic 3: said  coronavirus  people  health  19  new  covid  also  cases  virus  one  would  
Topic 4: 021  040  25000  421  4q  712  85th  885  accrues  accuser  acuity  afterthought

We may also want to go back and filter out the numbers if we don’t think are relevant. You can see this generates quite different results from the LSA we saw earlier. LSA is a good initial go to for topic modelling in the first instance. LDA offers a different option if required.

Summary

This primer presented an overview of some of the common methods used in modern NLP tasks and how you can implement these in Python. Each of these methods has nuances that need to be considered based on the task at hand. There are also methods like POS (Parts of Speech) tagging, also called grammatical tagging that can provide additional information to algorithms. For example you could tag certain words with information on if the word is a noun, verb, adjective, adverb etc. This is often represented as a list of tuples with word,tag e.g. [(‘build’,’v’), (‘walk’,’v’), (‘mountain’, ‘n’)]. There also exist many ways words and terms can be represented for subsequent processing such as word2vec and bag of words etc. The format you choose will again depend on task and algorithm requirements. As with all machine learning and data science, the time taken in the pre-processing of the data usually takes the longest and has a big impact on the outputs generated. After working through this primer you should hopefully be able to start to generate some interesting insights from textual data and have an appreciation for some of the approaches you may take to analyse this type of data.

References

[1] Davies, Mark. (2019-) The Coronavirus Corpus. Available online at https://www.english-corpora.org/corona/.


Related Articles