The world’s leading publication for data science, AI, and ML professionals.

How to Visualize Text Embeddings with TensorBoard

Easily create compelling charts from your text data and examine the quality of your embeddings

Image by author.
Image by author.

Table of contents

  1. Introduction
  2. Data Modeling 2.1 Import the dataset 2.2 Split in train and validation set 2.3 Tokenizer 2.4 Padding 2.5 Create and fit the model

  3. Export and visualize the embeddings 3.1 Configure TensorBoard 3.2 Play with the embeddings

  4. References

1. Introduction

A word embedding is any method which converts words into numbers, and it is the primary task of any Machine Learning (ML) workflow involving text data.

Independently from the problem being faced (classification, clustering, …), leveraging an effective numeric representation of the input text is paramount to the success of the ML model.

But what is an effective numeric representation of text? Basically, we want to embed a word in a number or a vector able to convey information about the meaning of a word.

One way to intuitively appreciate this concept is provided by word analogies¹, i.e. relationships of the form: "word₁ is to word₂ as word₃ is to word₄". This allows to solve questions such as "man is to king as woman is to ..?" by vector addition and subtraction, as shown in the following image:

The closest embedding to w(king) - w(man) + w(woman) is that of queen. Image from [1].
The closest embedding to w(king) – w(man) + w(woman) is that of queen. Image from [1].

Now, let us imagine to have a generic dataset where each text sample is associated to a class (or label). For example, we might be facing a sentiment analysis task (with labels such as: neutral, negative, positive, …) or an intent detection problem for a conversational agent. In this case, we would train a classification model, and assess its performances with metrics such as accuracy, precision, recall, F1 score… But how could we easily observe a graphic representation of our embedded text as in the above picture?

In this post, we cover all steps required to access appealing and dynamic visualizations of the text embeddings starting from a generic dataset imported from a file. To achieve this goal, we will use TensorFlow², a popular open-source ML library, and TensorBoard³.

2. Data Modeling

2.1 Import the dataset

In a previous post⁴, we described the manufacturing of a dataset made of the reviews of the readers from Rome’s libraries, starting from the open data made publicly available by "Istituzione Biblioteche di Roma"⁵.

The dataset is composed of two columns, one for the readers comments in plain text (their language is Italian) and one for their respective label. The labels identify the topic of the review⁶. We want to create a multi-class text classifier able to predict the topic of an input review, and observe the text embeddings.

We start by importing the needed libraries:

We load the dataset from a file. Notably, we are going to describe steps that can be applied to any dataset containing at least input text samples and their respective label, in any language:

Image by author.
Image by author.

The topics distribution is as follows⁴:

  1. Reviews about the condition of women in society, or novels with strong female protagonists (n=205, 25.5%)
  2. Reviews of albums and concerts, or biographies of musicians (n=182, 22.64%)
  3. Reviews of books and essays about economics and socio-political conditions (n=161, 20.02%)
  4. Reviews related to Japan or the Japanese culture (n=134, 13.67%)
  5. Reviews about scientific and technical divulgation essays (n=122, 15.17%)

    Image by author.
    Image by author.

For the purpose of classification, we need numeric labels. Therefore, we map the topics descriptions to integers:

2.2 Split in train and validation set

We split the data in train and validation set as follows:

2.3 Tokenizer

At this point, our training and validation sets are still made of text. In order to vectorize the text corpus by turning each review into a sequence of integers, we leverage the tf.keras.preprocessing.text.Tokenizer⁷ class.

We arbitrarily choose a vocabulary size, i.e. the maximum number of words to keep, then instantiate the tokenizer class and fit it on the training set. After that, we generate numeric sequences for both train and validation sets:

How does a tokenized text sequence look? We can observe a random processed sequence as follows:

Image by author.
Image by author.

From the above image, we can see how each word of the input sentence is mapped to a distinct integer creating a numeric sequence. For words that appear more than once, we notice how they always get the same numeric representation (e.g. "molto =>39").

We finally create a dictionary containing the association between words and integers:

2.4 Padding

We effectively converted text into vectors, but we still cannot use them as input of a ML model, as any sentence may have a different length. We now have to find a way to make them uniform (i.e. same size) without creating undesired noise.

One way to obtain sequences of the same size is by padding them, i.e. adding the same value of convenience (tipically zero) to shorter sequences, allowing them to match the desired length:

Padding example. IMage by author.
Padding example. IMage by author.

One way to achieve this in TensorFlow is by using tf.keras.preprocessing.sequence.pad_sequences⁸:

Image by author.
Image by author.

Notably, pad_sequences accepts an input argument (padding) to specify whether the zeros should be added before (default behaviour) or after each sequence. We can print and inspect one padded sequence by prompting data_train[0]:

Padded sequence. Image by author.
Padded sequence. Image by author.

2.5 Create and fit the model

We define a simple model composed of Dense⁹ and Dropout¹⁰ layers. One may modify parameters and layers at will, but it is important to remember that:

  1. The last Dense layer must have an output space dimensionality equal to the number of classes we want to predict.
  2. As we are facing a multi-class classification problem, we leverage the softmax activation function. It estimates the discrete probability distribution over the target classes.
  3. The Embedding¹¹ layer should have an input dimensionality equal to the size of the vocabulary, i.e. maximum integer index +1:

    We choose to train the model for 50 epochs, but we also use the EarlyStopping callback in order to monitor the validation loss during training: if the metric does not improve for at least 3 epochs (patience = 3), the training is interrupted and the weights from the epoch where the validation loss showed the best value (i.e. lowest) are restored (restore_best_weights = True):

    We can observe the categorical accuracy from the History object returned by the fit method:

    Image by author.
    Image by author.

The creation of a multi-class text classifier is propaedeutic to our final goal, i.e. providing a guide to extract the text embeddings from a generic model and explore them visually. Therefore, we will not concentrate our efforts in further model improvement and performance assessment.

3. Export and visualize the embeddings

3.1 Configure TensorBoard

TensorBoard is "TensorFlow’s visualization toolkit"³. It is a tool that provides useful measurements and visualizations to monitor ML workflows. For example, one may employ TensorBoard to track metrics such as loss and accuracy, observe the model graph, explore weights and biases, and project embeddings to lower dimensional spaces.

To this aim, in the previous sections we imported the TensorBoard Embedding Projector¹³:

from tensorboard.plugins import projector

We load the TensorBoard notebook extension:

%load_ext tensorboard

We create a directory to store the needed information. Then, we save the word-to-integer dictionary and the weights from the Embedding layer inside the directory, and finally set up the projector.ProjectorConfig() configuration¹³:

We finally start TensorBoard by specifying the logging directory we just created through the logdir parameter:

%tensorboard --logdir /logs/fit/

3.2 Play with the embeddings

In order to visualize the embeddings, we select PROJECTOR from the dropdown menu on the top right of the TensorBoard dashboard:

Image by author.
Image by author.

On the bottom left, we notice that the interface provides a variety of dimensionality reduction techniques to choose from (UMAP, t-SNE, PCA, custom) in order to visually inspect the high-dimensional vectors in either two or three projected dimensions.

We can use the top-right menu to search for specific words and highlight their closest neighbours using either cosine similarity or euclidean distance.

For example, the word "pianoforte" (piano) is closer to the name of "Bach", which is likely due to a prevalence of reviews of piano arrangements from the renowned German composer:

Image by author.
Image by author.

If we search for the word "Giapponese" (Japanese), the three closest words are, in order: "Miyazaki", "urbanistica" (urban planning) and "Banana", indicating possible preferences of the readers for the authors Hayao Miyazaki and Banana Yoshimoto, as well as an interest for Japanese urban planning concepts (e.g. machizukuri):

Image by author.
Image by author.

4. References

[1] Carl Allen, Timothy Hospedales, "Analogies Explained: Towards Understanding Word Embeddings", 2019, arXiv:1901.09813.

[2] tensorflow.org/

[3] tensorflow.org/tensorboard

[4] towardsdatascience.com/multi-label-text-classification-using-bert-and-tensorflow-d2e88d8f488d

[5] www.bibliotechediroma.it/it/open-data-commenti-lettori

[6] towardsdatascience.com/romes-libraries-readers-comments-analysis-with-deep-learning-989d72bb680c

[7] tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

[8] tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences

[9] tensorflow.org/api_docs/python/tf/keras/layers/Dense

[10] tensorflow.org/api_docs/python/tf/keras/layers/Dropout

[11] tensorflow.org/api_docs/python/tf/keras/layers/Embedding

[13] tensorflow.org/tensorboard/tensorboard_projector_plugin


Related Articles