Word Embeddings vs TF-IDF: Answering COVID-19 Questions

Tirtha

Published in

Towards Data Science

8 min readApr 4, 2020

A comparison of text similarity methods for answering COVID-19 questions.

Dataset: CORD-19

Questions we are interested in:

Data on potential risks factors
Smoking, pre-existing pulmonary disease
Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities
Neonates and pregnant women
Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.
Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
The severity of the disease, including the risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
Susceptibility of populations
Public health mitigation measures that could be effective for control

Preliminaries

This post assumes a basic understanding of NLP, such as Bag of Words, and how text is represented as numbers.

Method 1: TF-IDF and Cosine Similarity

TF-IDF stands for Term Frequency — Inverse Document Frequency, a commonly-used method in information retrieval tasks [1]. We are going to use it to find sentences that are similar to our search questions.

To do that, we need to represent each sentence as a vector. TF-IDF creates these vectors by weighing the terms by their prevalence across the documents. If a term occurs in almost all the documents in the corpus (which means that the term is useless and can be ignored), the IDF part of the formula ensures it gets a very low weight in the sentence vector. If a term is important in identifying a sentence (which means it doesn’t appear in many other sentences), it will get a high weight.

If we use TF-IDF to vectorize the following corpus of three sentences, try to guess which terms would get high weights and which ones would be weighted less.

Sentence 1: The mean incubation period

Sentence 2: The mean risk period

Sentence 3: The risk of transmission

Since the word the is present in all sentences, it is fairly unimportant and would get a low weight, while the terms incubation and transmission would get very high weights, as they are important in identifying sentences 1 and 2 respectively.

Ideally, you would use bigrams to capture terms like “incubation_period,” but let’s stick with unigrams for the sake of simplicity.

I didn’t calculate anything here so I wrote high, medium, low instead of actual numbers. I suggest you calculate the matrix by hand if you aren’t too familiar with TF-IDF, but if you are, keep reading.

Now, we want to find the sentence that talks about “the incubation period.” If we vectorize this search query using the already-built TF-IDF matrix, we would get this vector for the search query:

Notice that this vector is quite similar to the vector for sentence 1. If we calculate the search query vector’s proximity to the sentence vectors, we would find that it is most close to Sentence 1. We’ll use Cosine Similarity to calculate this proximity since it’s much better suited for text data than normal Euclidean Distance.

You may have realized that a major drawback exists in our TF-DF method. The same concept might be worded differently in different papers. Example: the terms infant and adolescent mean the same thing, but if we’re searching for “risk factors in infants,” we’ll miss sentences that refer to infants as adolescents. Here’s where word embeddings come in…

Method 2: Word Embeddings and Word Mover’s Distance

TF-IDF vectors do not account for semantic similarities in language. The weight for adolescent has no relationship with the weight for infant. Word embeddings try to capture these relationships by relying on an elementary idea: Words that appear in the same context, have similar meanings. We’re going to use a word embeddings framework called word2vec, which learns vector representations for words in a corpus [2]. Here’s an example:

Sentence 1: Merkel imposed a lockdown in Germany after the coronavirus outbreak.

Sentence 2: The chancellor imposed a lockdown in Germany after the COVID-19 outbreak.

We notice that the words Merkel and chancellor are used interchangeably; so is coronavirus and COVID-19. So it follows that these two words should have some sort of relationship.

Instead of having vectors for each sentence, Word Embeddings capture these semantic relationships by creating vectors for each word. In vector space the vector for a word like coronavirus would lie in close proximity to the vectors for analogous words like COVID-19, and far away from unrelated words like coffee or paper.

Building a representation like this without labels or manual work seems inconceivable, but it becomes obvious when you understand the process. These vectors are built using a shallow (1 hidden layer) neural network to predict a target word (like “Merkel”), given some context words (like “impose lockdown Germany”). Here’s an overly simple example using our sentences from earlier:

A neural network is trained to predict a word given its context word. When it tries to predict lockdown given merkel, the hidden units will get some weight for merkel.

Similarly, when predicting lockdown from chancellor, the hidden units will get some weights corresponding to chancellor that will help predict lockdown.

When you think about it, since a certain combination of weights is required to activate the output cell for lockdown, whatever words are predictive of lockdown should get similar weights.

The same goes for coronavirus and covid-19. Both these terms should get similar weights when predicting outbreak, since outbreak appears in both their contexts.

The weights in the hidden layer of the trained network become the vectors for the words, and the number of hidden layer neurons becomes the number of dimensions.

In the images above, notice that there are two neurons in the hidden layer, therefore, the word vectors are two-dimensional.

Here’s a visual example with four sentences:

The model was able to learn semantic relationships like merkel should be closer to chancellor than to other words like paris for example.

In these examples, I’ve created 2-dimensional embeddings. In a realistic situation though, it is typical to have 100–300-dimensional vectors.

As you may have guessed, this process requires a colossal amount of training data to learn accurate embeddings, so there are pre-trained word embeddings available online that you can use out-of-the-box.

After training word2vec on our CORD-19 data, here are the embeddings and similar words for infant. They are 100-dimensional, but I’ve used dimensionality reduction to be able to visualize them.

So now that we have vectors for each word, how do we measure the similarity between two sentences?

Word Mover’s Distance (WMD) accomplishes this task by locating the position of two sentences in word embedding space and transforming one into the other [3]. For calculating the similarity between sentences A and B, it takes each word in A and “drives” it to the nearest word in B, until A is transformed into B or vice versa. The WMD is the mean “distance traveled” by the words.

Sentence 1: ‘Merkel imposed lockdown’

Sentence 2: ‘Chancellor imposed lockdown’

Sentence 3: ‘Paris imposed lockdown’

merkel, from sentence 1, is driven 0.0724 “kilometers” to chancellor, in sentence 2; imposed and lockdown travel 0 kilometers since they are already present in sentence 2. Therefore, the WMD is (0.0727 + 0 + 0) = 0.0242

Here, merkel travels 0.6312 kilometers to paris so the WMD = (0.6312 + 0 + 0) / 3 = 0.2104

So with the help of word embeddings, we were able to determine that sentences 1 and 2 are more similar than sentences 1 and 3. Without word embeddings, sentence 1 would be equally similar to 2 and 3.

Now that we’ve covered the necessary ground to understand these methods, let’s try these out and see if we’re able to retrieve texts that have some similarity to our search queries.

And there we have it! Both methods retrieved quite precise answers. The word embeddings method could be improved by using more training data or using pre-trained embeddings.

Parting Thoughts

We discussed two methods for calculating the similarity between two texts and used them to retrieve answers for questions related to COVID-19. Even though WMD isn’t feasible for this type of task due to its high time complexity, it is still a powerful method and has enormous applicability in many domains. I can see it being quite effective at plagiarism detection. If a sentence has been copy-pasted and a few words replaced with synonyms from a thesaurus, word2vec and WMD could spot it quite easily, whereas TF-IDF would fall short. Another alternative to WMD could be to average the word vectors for a sentence with Smoothed Inverse Frequency and calculate Cosine Similarity. However, we often fail to appreciate the power of simple methods like TF-IDF, and as we have seen here, they should not be overlooked.

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

The code for this project is available in the notebook here https://github.com/tchanda90/covid19-textmining

You can play around with the search tool here https://tchanda90.github.io/covid19-textmining/

[1] Ramos, J. (2003, December). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (Vol. 242, pp. 133–142).

[2] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[3] Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957–966).