Deep learning with containers. Part 2

Easy experimenting and tracking with the deep learning stack deployed in containers.

Alexander Visheratin
Towards Data Science

--

Photo by Hanson Lu on Unsplash

In the previous post, we set up a multi-container application with Docker Compose, where one container is responsible for hosting a Jupyter Notebook environment, and another container holds a Tensorboard deployment. Containers are going to communicate via directories that are mounted as volumes from the host machine.

Today’s primary goal is to see how this setup works and how it enables easy experiment tracking. For this, we will start working towards creating a news recommender system that will be (obviously) backed by a neural network. As with any recommender system, it will propose top-10 most similar news, and a neural network will generate vectors that we will use to measure similarity. The main things about the neural network we are going to create:

  1. Dataset. We will use Microsoft News Dataset (MIND) [1] that contains about 160,000 English news articles. Specifically, we will calculate the similarity between news titles and abstracts.
  2. Neural network. We will use the DistilBERT Transformer network as a backbone and add a couple of additional layers to generate embeddings of reasonable size.
  3. Training. We will employ siamese learning to train the network to differentiate similar and non-similar items in the embedding space.

All the code used for this post is available here.

Let’s walk through some theory and start experimenting.

Siamese learning

This concept exists for a long time and gained a lot of popularity in many fields, especially in computer vision [2, 3, 4]. The main idea of siamese learning is to train a neural network to distinguish similar and non-similar samples. By similarity, we usually mean belonging to the same class. For this, a neural network generates N-dimensional embeddings from samples and uses distances between them to calculate the loss. We can think of an N-dimensional embedding as coordinates of a sample in the N-dimensional space. And the goal of training is to make sure that similar items are close in this space and non-similar are far from one another. We use specialized loss functions for siamese learning, the most widely used of which are contrastive loss [5] and triplet loss [6]. In our experiments, we will use the triplet loss function. The schema of siamese learning with triplet loss is shown in the figure below.

For siamese training with triplet loss, every training instance should be a group of three samples — anchor, negative, and positive — that is called triplet. We first pass these samples through the neural network to generate three embedding vectors. After that we calculate the Euclidean distance between anchor and negative vectors (D_neg) and anchor and positive vectors (D_pos). After that, we calculate the difference between D_pos and D_neg plus pre-defined margin M. And the final loss is the maximum between this difference and zero. It means that the triplet affects the neural network’s weights (loss is above zero) only if the anchor is closer to the negative sample than to the positive one more than the margin value. A visual depiction of this effect is shown in the figure below.

PyTorch implementation of the triplet loss that we will use in our experiments looks as follows:

Neural network

Our neural network will be based on the DistilBERT model [7], which is widely popular because it produces very good embeddings despite being relatively lightweight. Thanks to the great Hugging Face library, we can get a fully functional pre-trained network with only one line of code. And the overall neural network that we will experiment with looks as follows:

Data preparation

To train the neural network, we need to generate the triplets. MIND dataset contains categories and subcategories for every article. We will use categories as an attribute for triplets generation because it would be much harder to train the network on 200+ subcategories of the dataset.

The process of triplets generation is following:

  1. For every category, select a fixed number (class size, CS) of items. If we have N categories, we will have N*CS items as a result of this step.
  2. For every item (anchor), randomly select one item from the same category (positive) and one item from some other (negative). At this point, we have N*CS triplets.
  3. Repeat step 2 CN (connections number) times. Finally, we will have N*CS*CN triplets.
  4. Keep track of triplets and prevent generation of duplicates.

This algorithm does not include hard negatives mining, but we will hope that randomness will help to generate a good enough dataset. You can find the function for triplets generation here.

Tracking

We will track two primary components of the training process for this series of experiments — loss values (training and validation) and parameters (M, CS, TI, number of epochs, etc.). Tensorboard writer provides a function add_scalar for writing loss values that can be later visualized into nice plots that we will see later. In the Jupyter notebook, you can find three types of losses reported to Tensorboard — training, validation, and running. The third one is not entirely necessary but helps to better see the training dynamics and satisfies the impatience of waiting. For storing and writing parameters of the experiment, we use a dictionary where keys are names of parameters and values are parameters values. After training, this dictionary is written to Tensorboard using the add_hparams function.

Experiments

Let’s first check whether using just a pre-trained model would be enough to get good results.

DistilBERT vectors, no training

We will set a relatively low margin M=3 and run only the validation step using [CLS] token of DistilBERT output as embedding. As a result, we get a validation loss of 2.47, which is far from satisfactory because it is just a bit lower than M and way higher than zero. It means that we still need to train the network!

Three main parameters, which affect the training process and resulting quality are M (margin), CS (category size), and TI (triplets per item). We will start small with M=3, CS=30, and TI=10. This gives us 4,371 items in the dataset that we split 4/1 into training and validation datasets. The results below show that the network quickly grasped what is going on and dropped both train (left) and validation (right) losses to almost zero.

M=3, CS=30, TI=10

Now we are going to check if the network will do the same for the larger margin M=5. After three epochs, the validation dataset results were higher than for M=3, so I added one more epoch, after which the loss started to grow for both training and validation. It seems like there is not enough data for the network to learn.

M=5, CS=30, TI=10

Let’s change our parameters to CS=60 and TI=20, which give us 17,060 triplets in total. This dataset allows achieving the same results in terms of validation loss as for M=3 by making only one additional iteration. But it is worth mentioning that the overall training process was six times longer.

Orange trace — M=3, CS=30, TI=10; magenta trace — M=5, CS=60, TI=20

At this point, you might ask: “Why don’t we use abstracts along with titles for training?”. Great idea! For this, we will need to make minor changes in the architecture of the network — pass both title and abstract tokens through the transformer part, then concatenate [CLS] tokens and pass the resulting vector to linear layers. From the results, we can see that after four epochs, the network overfits to the training set. We could solve it by increasing the dataset more through CS and TI, but it would lead to an even larger training time to get the same quality as for the case without an abstract. So, for the sake of simplicity, we will abandon the idea of using abstracts for embeddings. I added the PyTorch Dataset and network classes to the notebook if you would be interested in experimenting with abstracts.

Magenta trace — title only, M=5, CS=60, TI=20; gray trace — title+abstract, M=5, CS=60, TI=20

Now we will challenge our network with an even larger margin M=10. From the results, we can see that, as with the small dataset and M=5, we don’t have enough data for the network to distinguish classes with such a large margin — validation loss plateaus after the third epoch.

Magenta trace — M=5, CS=60, TI=20; blue trace — M=10, CS=60, TI=20

Since we already know the recipe for solving this problem, we increase CS to 100 and TI to 30. The dataset with 42,257 triplets helps to achieve the desired dynamics of the validation loss and its drop under 0.1 after the fourth iteration.

Magenta trace — M=5, CS=60, TI=20; blue trace — M=10, CS=100, TI=30

And for the final step of our experiments, we will check if the network can conquer a very large margin M=30. For this, we obviously need to increase the dataset. With CS=200 and TI=50, we get a quite large dataset of 140,257 triplets. The results of the training are shown below. We can see that the network struggles to get a good “understanding” of the dataset, but after the fifth epoch, it manages to drop the validation loss to 0.13. For now, we will treat this as a good result and call it a day.

Blue trace — M=10, CS=100, TI=30; green trace — M=30, CS=200, TI=50

On a separate note, I want to mention the parameters tracking part of Tensorboard. Below you can see the table with all parameters that were automatically logged from the notebook every time I ran the training. It was incredibly convenient for writing this post because in a week after running these experiments, I would be able to remember any of them in no way.

Conclusion

In the second part of the series about running deep learning in the containerized environment, we looked into how we can track the model’s performance along with its parameters using Tensorboard. Since both Jupyter and Tensorboard run in Docker containers, you are sure that libraries and their files are not scattered around your system. And at any point, you can completely free your hardware resources with a simple docker compose down command.

In the last part of this series, we will investigate how to prepare the trained model for the most efficient inference and create a full-fledged web application based on this model with less than 100 lines of code.

--

--