Introduction
Word embeddings have become one of the most used tools and main drivers of the amazing achievements of Artificial Intelligence tasks that require processing natural languages like speech or texts.
In this post, we will unveil the magic behind them, see what they are, why they have become a standard in the Natural Language Processing (NLP hereinafter) world, how they are built, and explore some of the most used word embedding algorithms.
Everything will be explained in a simple and intuitive manner, avoiding complex maths and trying to make the content of the post as accessible as possible.
It will be broken down in the following subsections:
- What are word embeddings?
- Why should we use word embeddings?
- How are word embeddings built?
- What are the most popular word embeddings?
Once you are ready, let’s start by seeing what word embeddings are.
1) What are word embeddings?
Computers break everything down to numbers. Bits (zeros and ones) more specifically. What happens when a software inside a computer (like a Machine Learning algorithm for example) has to operate or process a word? Simple, this word needs to be given to the computer as the only thing it can understand: as numbers.
In NLP, the most simple way to do this is by creating a vocabulary with a huge amount of words (100.000 words let’s say), and assigning a number to each word in the vocabulary.
The first word in our vocabulary (‘apple‘ maybe) will be number 0. The second word (‘banana‘) will be number 1, and so on up to number 99.998, the previous to last word (‘king‘) and 999.999 being assigned to the last word (‘queen‘).
Then we represent every word as a vector of length 100.000, where every single item is a zero except one of them, corresponding to the index of the number that the word is associated with.

This is called one-hot encoding for words.
The one-hot encoding have various different issues related with efficiency and context, that we will see in just a moment.
Word embeddings are just another form representing words through vectors, that successfully solve many of the issues derived from using a one-hot encoding by somehow abstracting the context or high-level meaning of each word.
The main takeaway here is that word embeddings are vectors that represent words, so that similar meaning words have similar vectors.
2) Why should we use word embeddings?
Consider the previous example but with only three words in our vocabulary: ‘apple’, ‘banana’ and ‘king’. The one hot encoding vector representations of these words would be the following.

If we then plotted these word vectors in a 3 dimensional space, we would get a representation like the one shown in the following figure, where each axis represents one of the dimensions that we have, and the icons represent where the end of each word vector would be.

As we can see, the distance from any vector (position of the icons) to all the other ones is the same: two size 1 steps in different directions. This would be the same if we expanded the problem to 100.000 dimensions, taking more steps but maintaining the same distance between all the word vectors.
Ideally, we would want vectors for words that have similar meanings or represent similar items to be close together, and far away from those that have completely different meanings: we want apple to be close to banana but far away from king.
Also, one hot encodings are very inefficient. If you think about it, they are huge empty vectors with only one item having a value different than zero. They are very sparse, and can greatly slow down our calculations.
In conclusion: one hot encodings don’t take into account the context or meaning of the words, all the words vectors have the same distance in between them, and are highly inefficient.
Word embeddings solve these problems by representing each word in the vocabulary by a fairly small (150, 300, 500 dimensional) fixed size vector, called an embedding, which is learned during the training.
These vectors are created in a manner so that words that appear in similar contexts or have similar meaning are close together, and they are not sparse vectors like the ones derived from one-hot embeddings.
If we had a 2 dimensional word embedding representation of our previous 4 words, and plotted it on a 2D grid, it would look something like the following figure.

As we can clearly see from the previous image, the word embedding representations of the words ‘apple‘ and ‘banana‘ are closer together in between them than to the words ‘king‘ and ‘queen‘, where this applies in the opposite way: words with similar meanings are close together when we use word embeddings.
This fact also allows us to do something very very cool. We can operate with word embeddings, using representations of words to go from a known word to another one.
The following image shows how if we subtract the word embedding of the word ‘royal‘ from the embedding of the word ‘king‘ we arrive somewhere near the embedding of the word ‘man‘. In a similar manner, if we subtract the embedding of ‘royal’ from the embedding of queen, we arrive somewhere near the embedding of the word ‘woman‘. Cool right?

Lastly, as we can see in the word embedding vectors, they usually have a smaller size (2 in our example, but most times they have 150, 200, 300, or 500 dimensions) and are not sparse, making calculations with them much more efficient than with one-hot vectors.
3) How are word embeddings built?
As you have probably guessed, like many elements in the Machine Learning ecosystem, word embeddings are built by learning. Learning from data.
There are many algorithms that can learn word embeddings, and we will see them in just a bit, but the general goal is to build a matrix E, that can translate a one-hot vector representing a word, to a fixed sized vector that is the embedding of such word.
Let’s see a very high-level example of one way this could be done.
Consider the sentence "I love drinking apple smoothies‘. If I remove the word ‘apple‘ we are left with the following, incomplete sentence: ‘_I love drinking __ smoothies‘. If I then gave you this incomplete sentence, and told you to guess the missing word, you will probably say words like ‘banana_’, ‘strawberry‘, or ‘apple‘, which all have a similar meaning, and usually appear in similar contexts.
One of the main ways to learn word embeddings, is by a very similar process to this: the algorithms learn similar word embedding for words that appear many times in similar contexts by guessing missing words in a huge corpus of text sentences.
An embedding matrix E (the matrix that translates a one hot embedding into a word embedding vector) is calculated by training something similar to a language model (a model that tries to predicts missing words in a sentence) using an Artificial Neural Network to predict this missing word, in a similar manner to how the weights and biases of the network are calculated.
In practice, you can avoid training your own word embeddings, as there are publicly available word embeddings built from various corpuses (like Wikipedia or Twitter GloVe Word embeddings), saving you time and effort.
To end, lets briefly see some of the most popular word embedding algorithms.
4) What are the most popular word embeddings?
The two most used Word embedding algorithms are Word2Vec and GloVe. Let’s see how they work.
- Word2Vec: Word2Vec is a group of related models that produce word embeddings by using two-layer, shallow artificial neural networks that try to predict words using their context (Continuous bag of words – CBOW), or to predict the context using just one word (Skip-gram model). This is the process that was described in the previous section.

- GloVe: Short for Global Vectors, ** the GloVe algorithm calculates word embeddings by using a co-occurrence matrix in between words.This matrix is built by reading through a huge corpus of sentence**s and creating a column and a row for every unique word it finds. For every word, it registers how many time it appears in the same sentence with other words using a specific window size, so it also has a measure of how close together two words are in a sentence.

Conclusion and additional Resources
That is it! As always, I hope you enjoyed the post, and that I managed to help you understand what word embeddings are, how they work, and why they are so powerful.
Here you can find some additional resources in case you want to learn more about the topic:
- A Neural probabilistic model. Bengio et al.
- Lecture on Learning word embeddings by Andrew Ng.
- Machine Learning Mastery post on Word Embeddings.
- Overview of Word Embeddings and their use on semantic models.
_If you liked this post then feel free to follow me on Twitter at @jaimezorno. Also, you can take a look at my other posts on Data Science and Machine Learning here. Have a good read!_
If you want to learn more about Machine Learning and Artificial Intelligence follow me on Medium, and stay tuned for my next posts! Also, you can check out this repository for more resources on Machine Learning and AI!
Lastly, check out my other posts on Deep Learning for NLP: