The world’s leading publication for data science, AI, and ML professionals.

Word Embeddings: Intuition behind the vector representation of the words

In this article I would like to talk about how the words are commonly represented in Natural Language Processing (NLP), and what are the…

Getting Started

Photo by Sincerely Media on Unsplash
Photo by Sincerely Media on Unsplash

In this article I would like to talk about how the words are commonly represented in Natural Language Processing (NLP), and what are the drawbacks of the "classical" word-vector representation, which word embeddings alleviate. In the practical section of the chapter, I am going to train a simple embedding on a character level (using a similar approach, one can use to train embeddings on a word level as well).

Early Methods: Sparse Vectors.

Some of the readers might be curious: why do we need to convert words into vectors? The main reason is that, if done properly, then by using the word-vector representations we can directly apply many well known Machine Learning algorithms in order to solve a problem that we have at hand, like sentiment analysis, text classification, text generation etc.

One-Hot Encoding.

One of the first methods, that was used in order to convert words into vectors was using the idea of One-Hot Encoding. To describe it briefly: we would have a vector of the size equal to our vocabulary which is going to contain all zeros, except in one position where we are going to have 1.

Figure created by Oleg Borisov.
Figure created by Oleg Borisov.

This position where we have 1 is going to denote exactly what word we have. In the figure above as we can see, if machine produces a vector with 1 in the second position (0, 1, 0, 0, …) then the machine is referring to a cat.

This is a very simple approach towards the words vectorisation but it has multiple drawbacks. First of all, our vector is very sparse since we have mostly zeros and only one non-zero value (in a real-world case scenario we would expect to deal with a vocabulary of 10000 words or more).

Secondly, one-hot vectors do not encode any similarity or dissimilarity information about the words, if we happen to consider some similarity metrics like: cosine similarity or Euclidean distance.

In the example of Cosine Similarity we focus on the angle between two vectors.

Figure taken from https://github.com/sagarmk/Cosine-similarity-from-scratch-on-webpages
Figure taken from https://github.com/sagarmk/Cosine-similarity-from-scratch-on-webpages

When the similarity score is 1 (or close) then the two vectors are the similar, when 0 then two vectors are independent, when -1 then two vectors point in the opposite direction. In case of word-vector representation, we would say that when the similarity score is -1 then the words are similar but have opposite meaning, for example words "hot" and "cold".

As you can probably see, one-hot encoding does not help us to encode cosine of Euclidean similarity measure, because all of the vectors are independent. For example in the case of cosine similarity words: "science" and "scientist" which could be represented by vectors:

science   = [0, 0, 0, 1, 0, ...]
scientist = [0, 1, 0, 0, 0, ...]
cos(theta) = 0

Since the cosine similarity is 0, we conclude that two words are independent, which we might argue should not be the case, as two words are very similar.

To address this issue, people came up with another method, which I will briefly describe below.

K-shingles.

The idea here is to split any word by the shingles of the size k (shingles are similar to the N-Grams but on a character level). For example, consider words: "science" and "scientist", and shingles of size 4.

k = 4
Word: "science",
Shingles: ['scie', 'cien', 'ienc', 'ence']
Word: "scientist"
Shingles: ['scie', 'cien', 'ient', 'enti', 'nits', 'itst']

Each of the shingles would also be encoded using one-hot representation. And to achieve word-vector representation we would need to sum up all the single-vectors that the word consists of. In our example it could be:

Science_Vector =   [1, 0, 1, 0, 0 ,..., 0, 0, 0, 1, 0, 0, 1]
Scientist_Vector = [1, 0, 1, 1, 0 ,..., 0, 1, 0, 0, 1, 1, 0]

As we can see, using this example, we have some shingles overlap between the two words "science" and "scientist", in particular shingles ‘scie’, ‘cien’ are common for both words in this case. Because of that, the cosine similarity is not going to be zero every single time, which is a good thing!

Of course, this approach also has some pros and cons. In fact, many applications still use this method (with some smart modifications) as it allows us to perform analysis on the words even in the case when some misspellings occur (as you could imagine one-hot encoding would not be able to tackle this issue).

One of the downsides, however, is that some words might appear to be similar based on some metric, while in fact linguistically they should not have anything in common, for example words: "right" and "write", will have most of the shingles in common, but have a completely different meanings.

The other inconvenience, once again is that the vectors we are going to work with are still very sparse. Surely, we have more ones in the vector, but unfortunately out vectors still mostly consist of zeros.

Because of that, researchers started to think of creating some kind of compact vector representations, which are also known as dense vectors.

Dense Vectors: Word Embeddings.

The main goal here is to create a word-vector representation which is not going to be of a size of couple of thousands dimensions, but much smaller, on a scale of around 300 dimensions. For the dense representations we will not longer have vectors of ones and zeros only, but rather we will allow each dimension to have any floating value.

Figure created by Oleg Borisov. Axis denote the features that could have been learned by the embedding. As we can see Cat and Dog vectors can only lie in the Legs and Breath hyperplane, while a Human word-vector representation lies in the hyperplane of those 3 features.
Figure created by Oleg Borisov. Axis denote the features that could have been learned by the embedding. As we can see Cat and Dog vectors can only lie in the Legs and Breath hyperplane, while a Human word-vector representation lies in the hyperplane of those 3 features.

The best thing is that if trained and obtained properly, we will have similar words pointing in the similar direction and also make use of some similarity measures. Apart from that, each of the axis will be able to also represent some abstract information like presented on the image above. Those dense representations are also referred as the word embeddings.

The main question remains however, how can we achieve and train such word embedding? In other words, given some text, how can we map it into a hyperplane to achieve some similar representation to the above one?

One of the approaches, we can use would use a simple method of predicting the next token given the previous one. In order to perform this kind of task we can a use of simple Neural Network architecture. Let’s dig a little bit deeper in this task.

Figure created by Oleg Borisov
Figure created by Oleg Borisov

The general intuition is presented in the figure above. Given some tokenized text _t_1, t_2, t3. We will take token _t1, supply it into our embedding layer, which will transform the token into its vector representation _v1, then we will use this vector and Feed Forward (FF) Layer to predict what should be the next token. In this case we would like to be get token _t2, so we will use Backpropagation algorithm to update the weights of our system. Then, of course, we keep the same training procedure for all other tokens in our text.

Of course _t1 will be represented in one-hot encoding format, but in the output of embedding layer we will transform our large sparse vector to a much denser representation.

The main thing to note here is that, after the training we actually do not care at all about the FF layer, after the training we can simply drop and forget it. The most important for us is the Embedding layer, which has trained to convert our tokens into a lower dimensional vector space!

This is of course an oversimplified example, but I hope it makes sense why this architecture is useful. Now lets shortly move to the practical part where I will implement this architecture.

Implementation.

Photo by Joshua Sortino on Unsplash
Photo by Joshua Sortino on Unsplash

In this implementation section, I am going to show how can we use the approach outlined above, in order to create character embeddings in 2-dimensional space. The reason is that for word embeddings we would have to use a much larger dataset to train on, and it would be impossible to get any sensible results using 2-dimensional embedding vectors.

As a training text I will use Frankenstein book which I have used in one of the previous stories (when I was talking about Language Modelling). The code and related material is available on my GitHub page.

As you know we have 26 letters in the english alphabet + 10 digits + 1 space character, therefore, we will be working with 37 characters in total, which is going to be our vocabulary for this task (I have removed any punctuation signs for simplicity).

Lets create our Neural Network, to do that I use PyTorch:

Before we even start training our neural network, let’s check what our embedding currently represents, of course the result is going to be random. By the way the " * " symbol in the figure denotes the space character, so that we can see it easily on the image.

Random Embedding
Random Embedding

Now, let’s train the model for 80 epochs.

And after that is done, we can visualise our character embeddings in the 2-D plane.

Trained Embedding
Trained Embedding

As we can see, the result is quite amazing, as we can observe that the numeric characters are clustered together, as well as the vowels (even though "o" ran further away). So it is still interesting that the Embedding layer that we have trained has managed to pick up some understanding of the characters relationship in the text.

Final remarks.

Of course this is a very simplified example, which gives you a gist idea of how the embeddings are trained in the real world scenario. In some of the implementations the similar approach with a little improvement is used. The modification lies, in supplying more tokens to the neural network like t{i–2}, t{i–1}, t{i+1}, t{i+2}, with the goal of predicting _ti token. Such an architecture is called Continuous Bag of Words model.

The other idea which could be used, and which in fact if even more time efficient is Skip-Ngram model, which inverses the operation. Here we will supply to token _ti, and would want the model to output it’s neighbouring tokens t{i–2}, t{i–1}, t{i+1}, t{i+2}.

Image taken from Ling, Wang & Dyer, Chris & Black, Alan & Trancoso, Isabel. (2015). Two/Too Simple Adaptations of Word2Vec for Syntax Problems.
Image taken from Ling, Wang & Dyer, Chris & Black, Alan & Trancoso, Isabel. (2015). Two/Too Simple Adaptations of Word2Vec for Syntax Problems.

I am not going to much into details here, because on the basic conceptual level they are very similar to the model we have discussed and implemented in this article, but they are adding more context into the game which helps the embedding to learn better representations.

Are the word embeddings better than the sparse word-vector representations? Well, this depends on the application. In fact, dense word embeddings have no possibility of addressing misspelling issues, as misspellings of any word will most likely would be outside of our vocabulary.

In chat bot settings people some times use abbreviations or common phrases that might confuse the system, e.g. "lol", "lemme", "gimme", "searchin". Even differences between UK English and US english might cause confusion, e.g. "color" vs "colour".

In research, it has been also found that the embedding vectors can magically understand that there are some relationships between the words like "a King to a Queen is a as a Man to a Woman", as well as that if we are talking with respect to a Country-Capital hyperplane, then Spain is related to Madrid similarly as Italy is related to Rome.

Image taken from Eligijus Bujokas
Image taken from Eligijus Bujokas

While this is obviously fascinating, the important thing to look out for is that word embeddings might be biased! They might be biased on a gender, racial or some other basis. For example, if we happen to start exploring embedding space on the job plane, we might recover that "a businessman to a man is like, a babysitter to a woman", which is incorrect and unethical thing to say. This issue arises from the data that was used to train the embedding, and if the text had some biases towards one of the genders or nationality or race, then the embedding might be learn this bias as well.

Thus depending on the application you might want to avoid using the word-embeddings to avoid discrimination against some of the groups.


Thank you very much for reading this article, stay tuned for more interesting NLP topics that will be discussed in the future stories!


Related Articles