The world’s leading publication for data science, AI, and ML professionals.

Word Embeddings and Embedding Projector of TensorFlow

Theoretical explanation and a practical example.

Photo by Ross Joyner on Unsplash
Photo by Ross Joyner on Unsplash

Word embedding is a technique to represent words (i.e. tokens) in a vocabulary. It is considered as one of the most useful and important concepts in natural language processing (NLP).

In this post, I will cover the idea of word embedding and how it is useful in NLP. Then, we will go over a practical example to comprehend the concept using embedding projector of TensorFlow.

Word embedding means representing a word with vectors in n-dimensional vector space. Consider a vocabulary that contains 10000 words. With traditional number encoding, words are represented with numbers from 1 to 10000. The downside of this approach is that we cannot capture any information about the meaning of words because numbers are assigned to words without any consideration of the meaning.

If we use word embedding with a dimension of 16, each word is represented with a 16-dimensional vector. The main advantage of word embedding is that words that share a similar context can be represented close to each other in the vector space. Thus, vectors carry a sense of semantic of a word. Let’s assume we are trying to do sentiment analysis of customer reviews. If we use word embeddings to represent words in the reviews, words associated with positive meaning point a particular way. Similarly, words with negative meaning are likely to point in a different direction than words with positive meaning.

A very famous analogy that bears the idea of word embeddings is king-queen example. It is based on the vector representations of the words "king", "queen", "man" and "woman". If we subtract man from the king and then add woman, we will end up with a vector very close to the queen:

There are different methods to measure the similarity of vectors. One of the most common methods is cosine similarity which is the cosine of the angle between two vectors. Unlike euclidean distance, cosine similarity does not take the magnitude of vectors into consideration when measuring the similarity. Thus, cosine similarity focuses on the orientation of the vectors, not the length.

Consider the words "exciting", "boring", "thrilling", and "dull". In a 2-dimensional vector space, the vectors for these words might look like:

Word embedding in 2-dimensional space
Word embedding in 2-dimensional space

As the angle between vectors decreases, the cosine of the angle increases and thus, the cosine similarity increases. If two vectors lay in the same direction (angle between them is zero), the cosine similarity is 1. On the other hand, if two vectors point in the opposite direction (angle between them is 180), cosine similarity is -1.

When we use word embeddings, the model learns that thrilling and exciting are more likely to share the same context than thrilling and boring. If we represented the words with integers, the model would have no idea of the context of these words.

There are different methods to create word embeddings such as Word2Vec, GloVe or an embedding layer of a neural network. Another advantage of word embedding is that we can use a pre-trained embedding in our models. For instance, Word2Vec and GloVe embeddings are open to the public and can be used for natural language processing tasks. We can also choose to train our own embeddings using an embedding layer in a neural network. For examples, we can add Embedding layer in a sequential model of Keras. Please note that it requires lots of data to train an embedding layer with to achieve a high performance.

The example we had with 4 words is a very simple case but grasps the idea and motivation behind word embeddings. To visualize and inspect more complicated examples, we can use the Embedding Projector of TensorFlow.


Embedding Projector

Embedding projector is an amazing tool to understand word embeddings. It allows you to load your own embedding and visualize it as well as analyze some pre-trained models.

I chose one of the pre-trained embeddings which is Word2Vec 10K but feel free to upload your own embedding using the load option. TensorFlow has an informative tutorial on word embeddings that also explains how to load data to the embedding projector.

Let’s go back to Word2Vec 10K example which includes 10000 points in 200 dimensions.

It does not tell much with this image but you can move it around to see different words. Each dot in the image above represents a word. If you click on a dot, it shows the word and the nearby words. For example, I clicked on the dot that represents the word "own":

The projector shows how many times "own" is present in the text and a list containing the nearest points. The nearest point to "own" is "their" which makes sense because these words are likely to occur together as in "their own".

After we select a particular word, we can isolate a certain amount of words that are associated.

I selected the word "grandfather" and isolated 101 words. It returned a less dense view which makes it easier to analyze the embedding visually.


Word embeddings are considered to be one of the most important ideas in natural language processing which let NLP researchers as well as practitioners to take a big step in accomplishing complicated tasks.

Thank you for reading. Please let me know if you have any feedback.


Related Articles