Why Bother Deploying a Huge Neural Model when the Small One is Enough?

Model Compression Techniques that Perform Better than the Original One

Published in

Towards Data Science

6 min readApr 15, 2020

A deep neural network is one of the powerful machine learning methods. It has a tremendous outstanding performance in many tasks, including visual recognition, natural language processing, and also speech recognition. Building a deep neural network that performs well in a downstream task often requires having millions or even billions of parameters. One example from Krizhevsky’s model that won the ImageNet competition for image classification, it has more than 60 million parameters with only five convolutional layers and three fully connected layers [1].

Deploying a model with those huge parameters requires high resources and computationally expensive. While sometimes the target devices have limited resources, and it will be heavy to compute especially for a real-time application or application with an online learning algorithm. So, how can we reduce the model size but at the same time achieve the same performance to the original bigger model? Can we, somehow, transfer the knowledge from the larger models to the smaller ones?

In this post, we will go through model compression techniques to transfer knowledge from the huge neural model to the smaller one without significantly decreasing the performances. Doing model compression on a neural model can be done by using parameter sharing and pruning or by using knowledge distillation. We will look at the differences, the advantages, and also the drawback of them.

Parameter Sharing and Pruning

The idea of parameter pruning and sharing is by exploring the redundancy of the model parameters. By examining the original model parameters, we hope to remove the redundant and insensitive or unimportant parameters while still securing the original model performance. After discarding redundant parameters, ResNet-50 with 50 convolutional layers that have over 3.8 billion numbers of parameters can save more than 75% of parameters and 50% computational time [1]. Three techniques for doing parameter sharing and pruning are quantization and binarization, parameter pruning and sharing, and structural matrix.

Quantization and binarization. The idea of this technique is by reducing the number of bits to store each weight. The extreme case of this technique is by doing binarization (using 0–1 weight value) such as BinaryConnect, BinaryNet, etc. This technique has disadvantages of working with large CNNs such as GoogleNet that significantly lowered the accuracy of the model.
Parameter pruning and sharing. Just like the name, this technique tries to remove the redundant and non-informative weight in a pre-trained CNN. However, doing this technique in fully-connected layer might be cumbersome in memory consumption. Removing the weight can be done by reducing the total number of parameters and operation in the entire network or by using hash-function to group the weights for parameter sharing.
Structural matrix. To prune the fully-connected layer, we can apply a matrix multiplication with much fewer parameter. It will reduce the memory cost, and also speed up the inference and training stage time. However, adding matrix multiplication might bring bias to the model; thus, it may affect the performance. Even finding the proper structural matrix is difficult, since there is no theoretical way to derive it.

Since all the techniques are trying to remove some model parameters, it is hard to maintain the performance for not decreasing.

Knowledge Distillation

Different from the parameter pruning and sharing that tries to reducing the weight, knowledge distillation tries to reproduce the output of the cumbersome model (teacher) in a more compact model (student). So that the student model can generalize in the same way as the teacher model. The relative probabilities of incorrect answers show how the model tend to generalize. For example, an image of a car may have a tiny probability of being mistaken as a bus, but this probability is still larger than being mistaken as food.

Since the probabilities tell us about the performance of the teacher model, we then can use these probabilities as the target label to train the student model to transfer the generalization ability. These probabilities are called soft targets.

“When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.” [2]

To train the student model, we can use the same training set or use a separate training set from the teacher model. Figure 1 shows how knowledge distillation works. We update the student model by using soft loss when training it. However, giving the student model information about the hard label can significantly improve its performance. So, instead of updating the student model by using the soft loss, we update it by also using the hard loss. The hard loss is the entropy loss resulting from the hard prediction with the ground-truth label. We use the weighted average of the soft loss and the hard loss when training the student model.

Figure 1. Transferring knowledge illustration using soft targets from the teacher model [photo from author]

In MNIST dataset, this model performs well even the training data used to train the student lack of some classes. This method also works well on a huge dataset. It shows that doing knowledge distillation can be faster for the internal Google dataset called JFT that contains 15000 classes on 100 million labelled images to do image classification. The author split the dataset up into 61 specialist models (small models with subset training data) with 300 classes in each model. Instead of taking many weeks, the training only takes a few days on the smaller specialist models. The other improvement made by these specialist models were improving the accuracy by 4.4% compared to the cumbersome model.

Seeing the result of this knowledge distillation might trigger a question “does this method will perform well on every dataset?”. By studying a particular case of shallow linear and deep linear classifier [3], there are three critical factors for the success of knowledge distillation:

Data geometry. The first characteristic is the geometric properties of the data distribution that we used to train the model, i.e. class separation. The class separation has an immediate influence on the convergence speed of the risk. And the result shows that data with a higher degree of a polynomial is more effective, and the student could achieve lower risk.
Optimization bias. Not only in knowledge distillation, but the optimization bias in general also affects the speed of convergence in the training phase. In this case, using gradient descent is very favourable to minimize the student objective.
Strong monotonicity. As for how the training in the neural network usually work better when we increase the number of training data, it’s also applied here. Knowledge distillation will work if by increasing the amount of training data make decreasing the risk of the student model.

With the result of these experiments, the authors assume that similar properties also happens in the non-linear classifiers. Despite the success of knowledge distillation, it also has some drawbacks. One of them is that this technique can only be applied to classification tasks with a softmax function, which limits its usage. The other is that sometimes the model assumptions are too strict to make the performance competitive with different approaches.

References:

[1] Cheng, Y., Wang, D., Zhou, P. and Zhang, T., 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.
[2] Hinton, G., Vinyals, O. and Dean, J., 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
[3] Phuong, M. and Lampert, C., 2019, May. Towards understanding knowledge distillation. In International Conference on Machine Learning (pp. 5142–5151).

Why Bother Deploying a Huge Neural Model when the Small One is Enough?

Model Compression Techniques that Perform Better than the Original One

Parameter Sharing and Pruning

Knowledge Distillation

References:

Written by Ida Novindasari