Natural Language Processing Notes

Continuing on from our Natual Language Processing Notes series, you may have noticed I skipped Week 2. This is not by accident, I realized I have already made considerable notes on Bayes Theorem and Naive Bayes (links below) since this is all that has changed from week 1 to week 2 (the algorithm we use to predict the sentiment of the tweet).
Marginal, Joint and Conditional Probabilities explained By Data Scientist
What is a Vector Space Model?
Vector space models are algebraic models that are often used to represent text (although they can represent any object) as a vector of identifiers. With these models, we are able to identify whether various texts are similar in meaning, regardless of whether they share the same words.

The idea is based on a famous saying by an English linguist (and a leading figure in British linguistics during the 1950s) named John Rupert Firth…
"You shall know a word by the company it keeps" – J.R.Firth
There are numerous instances we may decide to employ a vector spaced model, for instance:
- Information Filtering
- Information Retrieval
- Machine Translation
- Chatbots
And many more!
In general, Vector space models allow us to represent words and documents as vectors.
Word By Word & Word By Doc
For us to represent our text as vectors we may decide to use a word-by-word or word-by-doc design. Performing this task involves first creating a co-occurrence matrix.
Although how we perform each task is quite similar, we will discuss each design one at a time, nonetheless, the objective is the same. We want to go from our co-occurrence matrix to a vector representation.

Word By Word: This design counts the number of times words occur within a certain distance k.

In the word by word design, the co-occurrence matrix is between 1 and N entries.
Word By Doc: The number of times words from the vocabulary appear in documents that belong to certain categories.

Using these vector representations, we can now represent our text or documents in vector space. This is perfect because in vector space we can determine the relationships between types of documents, such as their similarity.

Euclidean Distance
A similarity metric we may use to determine how far apart 2 vectors are from one another is the Euclidean distances, which is merely the length of a straight line that connects 2 vectors.

Let’s use the formula in Figure 6 to calculate the more similar documents using our vector representations from Figure 5.

The results tell us that the economy and Machine Learning documents are more similar since distance-based metrics prioritize objects with lower values to detect similarity. With that being said, it is important to note that the euclidean distance is not scale-invariant and it’s often recommended to scale your data.
Cosine Similarity
The problem with the Euclidean distance is that it is biased by size difference in representations. Therefore, we may decide to use the cosine similarity which would determine how similar text is using the inner angle.

Cosine similarity is one of the most popular similarity metrics used in NLP. To calculate similarity, we take the cosine similarity of an angle between two vectors.

When the cosine value is equal to 0 this means the two vectors are orthogonal to one another and have no match. Whereas, a cosine value closer to 1 would imply that there is a greater match between the two values (since the angles are smaller). Therefore, from our results, Economy and Machine Learning are the most similar – Read more about the Cosine Similarity metric on Wikipedia.
Manipulating Words in Vector Space
By performing some simple vector arithmetic, we are able to infer unknown representations among words.
For instance, if we know the relationship between two similar such as King and Man. In order to find the vector representation of the word "Queen", we can add the vector representation we retrieved from determining the relationship between King and Man (we retrieve this vector by subtracting the vectors i.e. King – Man) to the vector representation of Woman and inferring that the most similar vector representation (which would be Queen in this instance) is the vector we wanted to find.
Note: For more on this read Mikolov et al, 2013, Distributed Representations of Words and Phrases and their Compositionality.

Wrap Up
In conclusion, we may use vector space models to represent our text or documents in vector space, and when our data is in vector space, we can use the vectors to determine the relationships between text (or documents).
Let’s keep the conversation going on LinkedIn…