This article is regarding our recent paper in WACV 2021 on Neural Network compression.
The need to compress networks for tasks with sequential data such as videos for action recognition :
Networks such as Recurrent Neural Networks (RNNs) and its advanced variant Long short term (Lstm) are specialized for processing sequential data such as text, spoken words and videos. But these networks have large number of parameters and incur a significant amount of inference time. The number of hidden states in these networks is a hyper-parameter and the usually chosen number (256 or 512 or 1028) is often much larger than required for accurate prediction and leads to over-parameterization. In the task of action recognition from videos, the input video frame usually consisting of RGB stacked colored frames forms a high dimensional input. Hence for RNNs, the dimension of the input becomes high making the input-to-hidden matrix extremely large. For example, a video from the UCF11 dataset has RGB frames of dimensions 160×120 pixels each. Thus the total input size must be 160x120x3= 57,000. Even with a relatively small hidden state size of 256, the total parameters required in a single layer LSTM model is 58.9 Million. The naïve _**** over-parameterize_d one layer end-to-end LSTM model is seen to overfit with an accuracy of 67.7% on UCF11 dataset [4].
The idea of ‘relevance’ with Variational Information Bottleneck :
Several tensor decomposition methods [3,4,5] have been applied to RNNs to replace the standard input-to-hidden matrix with a low rank structure. These methods modify the input and model the input-to-hidden matrix to retain dense weight matrices of lower rank. However, most of the methods of compressing RNNs do not compress the size of the hidden-to-hidden matrix. A simple action recognition dataset, the UCF11 has lower number of classes than other larger datasets like UCF101, with low variations in the dataset. This prompts the need for only a few hidden states relevant to correct prediction of actions to model data representations. The Variational Information Bottleneck(VIB) theory introduced by Tishby et.al [1] introduce the idea of retaining only the relevant intermediate data representations in neural networks while conserving the accuracy of prediction.
![Figure 1: Tishby et. al [1] present the principle of variational information bottleneck to obtain the most concise yet prediction relevant representation based on information theoretic measures.](https://towardsdatascience.com/wp-content/uploads/2021/07/1v-INnALzu7tbShsdTxN9Uw.png)
We adapt this idea to the complex variant of RNNs, the LSTM network to remove redundant input features and hidden states and thus reduce the total number of model parameters.
How to retain relevant hidden states and input features to LSTMs with VIB ?
Motivated by the idea of removing redundant neurons with VIB, we make the following contributions towards compressing sequential networks:
(a) We propose a novel VIB-LSTM structure as as in Figure 2(a) that trains high accuracy sparse LSTM models.
(b) We develop a sequential network compression pipeline that sparsifies pre-trained model matrices of RNNs/LSTMs/GRUs.
(c) For architectures combining CNNs and LSTM, our VIB framework retains only the prediction relevant features which can be given as input to VIB-LSTM structure.
(d) We evaluate our method on popular action recognition datasets- UCF11, UCF101 and HMDB51 to yield compact models with validation accuracy comparable to that of state-of-art models.

Our goal is to learn a compressed representation gate representations i^t, f^t, o^t and g^t, while retaining relevant spatial and temporal information in v^t required for prediction. In Variational Information Bottlenecking (VIB) framework [1], this is cast as an optimization problem where the goal is to learn ktilda^T such that it has least information regarding the LSTM input v^t while retaining all the relevant information needed for learning the target Y. Note that compressing ktilda^T is equivalent to compressing h^T, as there is a deterministic mapping between them. __ Mathematically, it amounts to optimizing the following objective function:

where I(·) denotes mutual information between two random variables and θ is the parameter set of a compression neural network that transforms v^t to ktilda^T , β is a hyperparameter which controls the amount of trade-off between compression and prediction accuracy. Since the equation above is intractable in general because of the model complexity and infeasibility of the mutual information term, a variational upper bound is invoked [2]. We apply the VIB theory as a layer to the LSTM gate outputs which gives rise to the equations in Figure 2(b). This retains only the relevant hidden states, thus reducing the hidden state dimension of hidden-to-hidden matrix of LSTM. Similarly, for end-to-end and CNN-LSTM architectures we introduce a VIB layer on the input to the LSTM to retain only prediction-relevant feature vectors, thus reducing the input dimension of LSTM’s input-to-hidden matrix. This VIB based approach of Model Compression thus sparsifies all the LSTM matrices, unlike previous tensor decomposition methods for action recognition [3,4,5].
Inference on Raspberry Pi :
To test the inference speed-up of the compressed CNN-VIB-LSTM model obtained with our approach over the Naïve CNN-LSTM model for the same task, we deploy both the models on Raspberry Pi model 3. We take a single video from the UCF11 dataset of a human diving from a dive-board, labelled as ‘Diving’ and infer the action separately with both the uncompressed and the compressed model. The compressed model performs 100x faster than the uncompressed version as seen in Figure 3.


Conclusion and Further work :
We present a generic RNN compression technique based on the VIB theory. Specifically, we propose a compression scheme that extracts the prediction relevant information from the input features. To this end, we formulate a loss function aimed at LSTM compression with end-to-end and CNN-LSTM architectures for Human Action Recognition. We minimize the formulated loss function to show that our approach significantly compresses the baseline’s over-parameterized LSTM structural matrices thus reducing the problem of overfitting in them. Thus, our approach can produce models suitable for deployment on edge devices, which we show by deploying our CNN- VIB-LSTM trained model on Raspberry Pi and inferencing it. Further, we show that our approach can be effectively used with other compression methods to obtain even more significant compression with a little drop in accuracy. Further study could try a combination of tensor decomposition and VIB based compression for all variants of RNNs.
More details can be found in the paper and the Youtube video below.
References
[1] Tishby, Naftali, and Noga Zaslavsky. "Deep learning and the information bottleneck principle." 2015 IEEE Information Theory Workshop (ITW). IEEE, 2015.
[2]Alemi, Alexander A., et al. "Deep variational information bottleneck." arXiv preprint arXiv:1612.00410 (2016).
[3]Yang, Yinchong, Denis Krompass, and Volker Tresp. "Tensor-train recurrent Neural Networks for video classification." International Conference on Machine Learning. PMLR, 2017.
[4] Pan, Yu, et al. "Compressing recurrent neural networks with tensor ring for action recognition." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. №01. 2019.
[5] Ye, Jinmian, et al. "Learning compact recurrent neural networks with block-term tensor decomposition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.