Introduction to PyTorch Model Compression Through Teacher-Student Knowledge Distillation

Model compression through Knowledge Distillation can bring savings on the inference time, power efficiency and model size.

Published in

Towards Data Science

6 min readNov 9, 2018

Serving ML models in resource constrained mobile and real-time systems can be a real problem. The ML community has been developing solutions to compress the size of the models generated by larger clusters of servers. Model compression promises savings on the inference time, power efficiency and model size. All of that can let that flying rescue drone cover more land surface on a single battery charge, as well as not draining the batteries of your mobile app users.

Model Knowledge distillation is a method used to reduce the size of a model without loosing too much of its predictive powers.

Geoffrey Hinton’s talk at the Deep Learning Summit 2018 about using Knowledge Distillation (KD) led me to look up the current state of the art for another class of problems: Recommender systems (RecSys).

This led me to the excellent work published in KDD 2018 by Jiaxi Tang on Ranking Distillation (RD), where he discusses his method related to applying knowledge distillation to ranking tasks.

In this blog I replicated a small part of this Ranking Distillation work on the Movielens 100K dataset. Working on that was a bit of a realization. Even if KD is a solid conceptual framework for distilling knowledge from one model to a small model, applying it on Ranking tasks for recommender systems is not a trivial task.

The first challenge is that we are working at a lower level of abstraction than the usual fit/predict API that exists in higher level libraries such as Scikit-learn and Keras. This is because the change needed to implement this KD is at the loss function formulation itself. To tackle that, I followed on the footsteps of the RD paper and used the elegant PyTorch API for building this KD in RecSys.

The second challenge is that even if PyTorch is an elegant library we need a higher level framework that specializes on RecSys with PyTorch. The framework of choice these days seems to be Spotlight from Maciej Kula. I highly recommend it, the API design is easy to use and it lets the user customize most aspects that we are going to need for this experiment.

Let’s get going!

Custom Ranking Distillation Backpropagation Flows

The goal is to generate 3 models: Student model, Student model with distillation, and Teacher Model from the Movielens 100K dataset and compare their MAP@K metric as well as their physical disk size.

We need to explain the strategy we are going to use to teach some of that Dark Knowledge from the Teacher model to the Student model with distillation. Here is an attempt to explain what is going to happen during the training:

Flow of data and forward/backward propagation during the knowledge distillation method

In the above figure we show the training flow:

For the Student model we use a traditional approach using training data with data labels and a single ranking loss.
For the Teacher model, we pre-train it similar to the Student model but we use a larger network size to achieve a higher Mean Average Precision at K (MAP@K). After finishing the training of the larger model we store the pre-trained Teacher model.
For the Student model with Distillation we use the training data with the labels and the Ranking loss. However, in this we use the Teacher model’s predictions on the data that we feed to the student model as well. More precisely we use the teacher’s loss in addition to the student’s loss to calculate and backpropagate the gradients in the Student model’s network. This extra information supposedly should improve the predictive powers of the Student model with distillation while keeping the model size at the same level as the Student model without distillation.

Results Comparisons

First we need some training data, which we use to build a pre-trained Teacher model. We are using the movielens 100K dataset and only using the movie/user interaction. We will try to predict which top 5 movies a users is most probable to rate.

For this we will use the ImplicitFactorizationModel that the Spotlight Library provides. This model uses an embeddings-based model structure:

Implicit Factorization Model with a Bi-Linear model structure

For the loss we use a method similar to the following negative logarithmic of the likelihood function. We sample both positive and negative pairs, and we ask the optimizer to improve the ranking items from the positive pairs (d+) and decrease items from the negative pairs (d-):

Loss function related to the negative logarithmic of the likelihood function.

Training a “large” teach model with 200 as the size for each embedding layer on the movielens dataset give us the following metrics:

Let’s try the same with a much smaller model with 2 as the size of each embedding layer:

2 observations:

First, the size of the student model itself after serialization is smaller (0.10 mb vs 6.34). This is coherent with the size of the network since the embedding’s sizes are 100 smaller.
Third, the MAP@5 of the student model is lower than the teacher model (0.050 vs 0.073). The smaller network is able to get pretty far from the larger network. The challenge is: Can we do better while keeping the model size the same?

This is what we try next. We train a third model which is the student model with a boost from the pre-trained Teacher model.

To do that we need to mix the two losses we get from both model in the loss function. This is where PyTorch shines. All we have to do is define a modified loss function that sums up the student and teacher losses and let gradient descent do its magic. At its core if you are a bit familiar with the positive vs negative loss from using a log sigmoid loss, we pass the current batch of data through the teacher network and get candidate predictions, and use them to generate the teacher loss values. The final loss we use for our optimization is the sum of the three losses pos/neg/teacher. Here is a snippet of the combined loss function:

How did we do?

First the distilled model’s MAP@5 value is closer to the teacher model’s value using only 2 as the size of the embedding layers (0.070 vs 0.073)
Second the size is still at 0.10mb similar to the non-distilled student model.

Here is a table with all these values for comparison

What did we learn from this adventure?

Well I was pleased to see how flexible PyTorch was to be able to reproduce a small portion of a KDD2018 paper.
Knowledge Distillation is really cool and works for recommender systems as well.
In general any time there is an interaction between two or more AI models I am very interested in their results.

I invite you to dig deeper in the KDD2018 paper if you are interested in this type of cross model interactions. We did not cover how to improve this setup with weighting the Teacher’s model loss, or by only considering the top-k recommendations of the teacher’s model. Probably a future post.

Until then!

Thanks.

References:

Tang Jiaxi, and Ke Wang. “Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender System.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018.

Hinton Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).

Maciej Kula, Spotlight, 2017 https://github.com/maciejkula/spotlight

PyTorch, 2018, https://pytorch.org/