The world’s leading publication for data science, AI, and ML professionals.

Neural Network Pruning 101

All you need to know not to get lost

Whether it is in computer vision, natural language processing or image generation, deep neural networks yield the state of the art. However, their cost in terms of computational power, memory or energy consumption can be prohibitive, making some of them downright unaffordable for most limited hardware. Yet, many domains would benefit from neural networks, hence the need to reduce their cost while maintaining their performance.

That is the whole point of neural networks Compression. This field counts multiple families of methods, such as quantization [11], factorization [13], distillation [32] or, and this will be the focus of this post, pruning.

Neural network pruning is a method that revolves around the intuitive idea of removing superfluous parts of a network that performs well but costs a lot of resources. Indeed, even though large neural networks have proven countless times how well they could learn, it turns out that not all of their parts are still useful after the training process is over. The idea is to eliminate these parts without impacting the network’s performance.

Unfortunately, the dozens, if not hundreds, of papers published each year are revealing the hidden complexity of a supposedly straight-forward idea. Indeed, a quick overview of the literature yields countless ways of identifying said useless parts or removing them before, during or after training; it even turns out that not all kinds of pruning actually allow for accelerating neural networks, which is supposed to be the whole point.

The goal of this post is to provide a solid foundation to tackle the intimidatingly wild literature around neural network pruning. We will review successively three questions that seem to be at the core of the whole domain: "What kind of part should I prune?", "How to tell which parts can be pruned?" and "How to prune parts without harming the network?". To sum it up, we will detail pruning structures, pruning criteria and pruning methods.

1 – Pruning structures

1.1 – Unstructured pruning

When talking about the cost of neural networks, the count of parameters is surely one of the most widely used metrics, along with FLOPS (floating-point operations per second). It is indeed intimidating to see networks displaying astronomical amounts of weights (up to billions for some), often correlated with stellar performance. Therefore, it is quite intuitive to aim at reducing directly this count by removing parameters themselves. Actually, pruning connections is one of the most widespread paradigms in the literature, enough to be considered as the default framework when dealing with pruning. The seminal work of Han et al.[26] presented this kind of pruning and served as a basis for numerous contributions [18, 21, 25].

Directly pruning parameters has many advantages. First, it is simple, since replacing the value of their weight with zero, within the parameter tensors, is enough to prune a connection. Widespread Deep Learning frameworks, such as Pytorch, allow to easily access all the parameters of a network, making it extremely simple to implement. Still, the greatest advantage of pruning connections remains yet that they are the smallest, most fundamental elements of networks and, therefore, they are numerous enough to prune them in large quantities without impacting performance. Such a fine granularity allows pruning very subtle patterns, up to parameters within convolution kernels, for example. As pruning weights is not limited by any constraint at all and is the finest way to prune a network, such a paradigm is called unstructured pruning.

However, this method presents a major, fatal drawback: most frameworks and hardware cannot accelerate sparse matrices’ computation, meaning that no matter how many zeros you fill the parameter tensors with, it will not impact the actual cost of the network. What does impact it, however, is pruning in a way that directly alters the very architecture of the network, which any framework can handle.

Difference between unstructured (left) and structured (right) pruning: structured pruning removes both convolution filters and rows of kernels instead of just pruning connections. This leads to fewer feature maps within intermediate representations. (image by author)
Difference between unstructured (left) and structured (right) pruning: structured pruning removes both convolution filters and rows of kernels instead of just pruning connections. This leads to fewer feature maps within intermediate representations. (image by author)

1.2 – Structured pruning

This is the reason why many works have focused on pruning larger structures, such as whole neurons [36] or, for its direct equivalent within the more modern deep convolutional networks, convolution filters [40, 41, 66]. Filter pruning allows for an exploitable and yet fine enough granularity, as large networks tend to include numerous convolution layers, each counting up to hundreds or thousands of filters. Not only does removing such structures result in sparse layers that can be directly instantiated as thinner ones, but doing so also eliminates the feature maps that are the outputs of such filters.

Therefore, not only are such networks lighter to store, due to fewer parameters, but also they require less computations and generate lighter intermediate representations, hence needing less memory during runtime. Actually, it is sometimes more beneficial to reduce bandwidth rather than the parameter count. Indeed, for tasks that involve large images, such as semantic segmentation or object detection, intermediate representations may be prohibitively memory-consuming, way more than the network itself. For these reasons, filter pruning is now seen as the default kind of structured pruning.

Yet, when applying such a pruning, one should pay attention to the following aspects. Let’s consider how a convolution layer is built: for Cin input channels and Cout output ones, a convolution layer is made of Cout filters, each counting Cin kernels; each filter outputs one feature map and within each filter, one kernel is dedicated to each input channel. Considering this architecture, and acknowledging a regular convolutional network basically stacks convolution layers, when pruning whole filters, one may observe that pruning a filter, and then the feature map it outputs, actually results in pruning the corresponding kernels in the ensuing layer too. That means that, when pruning filters, one may actually prune twice the amount of parameters thought to be removed in the first place.

Let’s consider too that, when a whole layer happens to get pruned (which tends to happen because of layer collapse [62], but does not always break the network, depending on the architecture), the previous layer’s outputs are now totally unconnected, hence pruned too: pruning a whole layer may actually prune all its previous layers whose outputs are not somehow connected elsewhere (because of residual connections [28] or whole parallel paths [61]). Therefore, when pruning filters, one should consider computing the exact number of actually pruned parameters. Indeed, pruning the same amount of filters, depending on their distribution within the architecture, may not lead to the same actual amount of pruned parameters, making any result impossible to compare with.

Before changing topic, let’s just mention that, albeit a minority, some works focus on pruning convolution kernels, intra-kernel structures [2,24, 46] or even specific parameter-wise structures. However, such structures need special implementations to lead to any kind of speedup (as for unstructured pruning). Another kind of exploitable structure, though, is to turn convolutions into "shift layers" by pruning all but one parameter in each kernel, which can then be summed up as a combination of a shifting operation and a 1 × 1 convolution [24].

The danger of structured pruning: altering the input and output dimensions of layers can lead to some discrepancies. If on the left, both layers output the same number of feature maps, that can be summed up well afterward, their pruned counterparts on the right produce intermediate representations of different dimensions, that cannot be summed up without processing them. (image by author)
The danger of structured pruning: altering the input and output dimensions of layers can lead to some discrepancies. If on the left, both layers output the same number of feature maps, that can be summed up well afterward, their pruned counterparts on the right produce intermediate representations of different dimensions, that cannot be summed up without processing them. (image by author)

2 – Pruning criteria

Once one has decided what kind of structure to prune, the next question one may ask could be: "Now, how do I figure out which ones to keep and which ones to prune?". To answer that one needs a proper pruning criteria, that will rank the relative importance of the parameters, filters or else.

2.1 – Weight magnitude criterion

One criterion that is quite intuitive and surprisingly efficient is pruning weights whose absolute value (or "magnitude") is the smallest. Indeed, under the constraint of a weight-decay, those which do not contribute significantly to the function are expected to have their magnitude shrink during training. Therefore, the superfluous weights are expected to be those of lesser magnitude [8]. Notwithstanding its simplicity, the magnitude criterion is still widely used in modern works [21, 26, 58], making it a staple of the domain.

However, although this criterion seems trivial to implement in the case of unstructured pruning, one may wonder how to adapt it to structured pruning. One straightforward way is to order filters depending on their norm (L 1 or L 2 for example) [40, 70]. If this method is quite straightforward one may desire to encapsulate multiple sets of parameters within one measure: for example, a convolutional filter, its bias and its batch-normalization parameters together, or even corresponding filters within parallel layers whose outputs are then fused and whose channels we would like to reduce.

One way to do that, without having to compute the combined norm of these parameters, involves inserting a learnable multiplicative parameter for each feature map after each set of layers you want to prune. This gate, when reduced to zero, effectively prunes the whole set of parameters responsible for this channel and the magnitude of this gate accounts for the importance of all of them. The method hence consists in pruning the gates of lesser magnitude [36, 41].

2.2 – Gradient magnitude pruning

Magnitude of the weight is not the only popular criterion (or family of criteria) that exists. Actually, the other main criterion to have lasted up to now is the magnitude of the gradient. Indeed, back in the 80’s some fundamental works [37, 53] theorized, through a Taylor decomposition of the impact of removing a parameter on the loss, that some metrics, derived from the back-propagated gradient, may provide a good way to determine which parameters could be pruned without damaging the network.

More modern implementations of this criterion [4, 50] actually accumulate gradients over a minibatch of training data and prune on the basis of the product between this gradient and the corresponding weight of each parameter. This criterion can be applied to the aforementioned gates too [49].

2.3 – Global or local pruning

One final aspect to take into consideration is whether the chosen criterion is applied globally to all parameters or filters of the network, or if it is computed independently for each layer. While global pruning has proven many times to yield better results, it can lead to layer collapse [62]. A simple way to avoid this problem is to resort to layer-wise local pruning, namely pruning the same rate at each layer, when the used method cannot prevent layer collapse.

Difference between local pruning (left) and global pruning (right): local pruning applies the same rate to each layer while global applies it on the whole network at once. (image by author)
Difference between local pruning (left) and global pruning (right): local pruning applies the same rate to each layer while global applies it on the whole network at once. (image by author)

3 – Pruning method

Now that we have got our pruning structure and criterion, the only parameter left is which method should we use to prune a network. This is actually the topic on which the literature can be the most confusing, as each paper will bring its own quirks and gimmicks, so much that one may get lost between what is methodically relevant and what is just a specificity of a given paper.

This is why we will thematically overview some of the most popular families of method to prune Neural Networks, in an order that highlights the evolution of the use of sparsity during training.

3.1 – The classic framework: train, prune and fine-tune

The first basic framework to know is the train, prune and fine-tune method, which obviously involves 1) training the network 2) Pruning it by setting to 0 all parameters targeted by the pruning structures and criterion (these parameters cannot recover afterwhile) and 3) training the network for a few extra epochs, with the lowest learning rate, to give it a chance to recover from the loss in performance induced by pruning. Usually, these last two steps can be iterated, with each time a growing pruning rate.

The method proposed by Han et al. [26] applies this method, with 5 iterations between pruning and fine-tuning, to weight magnitude pruning. Iterating has shown to improve performance, at the cost of extra computation and training time. This simple framework serves as a basis for many works [26, 40, 41, 50, 66] and can be seen as the default method over which all the others have built.

3.2 – Extending the classic framework

While not straying too far, some methods have brought significant modifications to the aforementioned classic framework by Han et al. [26]. Gale et al. [21] have pushed the principle of iterations further by removing an increasing amount of weights progressively all along the training process, which allows benefiting from the advantages of iterations and to remove the whole fine-tuning process. He et al. [29] reduce prunable filters to 0, at each epoch, while not preventing them from learning and being updated afterward, in order to let their weights grow back after pruning while enforcing sparsity during training.

Finally, the method of Renda et al. [58] involves fully retraining a network once it is pruned. Unlike fine-tuning, which is performed at the lowest learning-rate, retraining follows the same learning-rate schedule as training, hence its name: "Learning-Rate Rewinding". This retraining has shown to yield better performance than mere fine-tuning, at a significantly higher cost.

3.3 – Pruning at initialization

In order to speed up training, avoid fine-tuning and prevent any alteration of the architecture during or after training, multiple works have focused on pruning before training. In the wake of SNIP [39], many works have studied the use of the work of Le Cun et al. [37] or of Mozer and Smolensky [53] to prune at initialization [12, 64], including intensive theoretical studies [27, 38, 62]. However, Optimal Brain Damage [37] relies on multiple approximations including an "extremal" approximation that "assumes that parameter deletion will be performed after training has converged" [37]; this fact is rarely mentioned, even among works that are based on it. Some works have raised reservations about the ability of such methods to generate masks whose relevance outshines random ones of similar distribution per layer [20].

Another family of methods that study the relationship between pruning and initialization gravitates around the "Lottery Ticket Hypothesis" [18]. This hypothesis states that "randomly-initialized, dense neural network contains a subnet-work that is initialized such that – when trained in isolation – it can match the test accuracy of the original network after training for at most the same number of iterations". In practice, this literature studies how well a pruning mask, defined using an already converged network, can be applied to the network back when it was just initialized. Multiple works have expanded, stabilized or studied this hypothesis [14, 19, 45, 51, 69]. However, once again multiple works tend to question the validity of the hypothesis and of the method used to study it [21, 42] and some even tend to show that its benefits rather came from the principle of fully training with the definitive mask instead of a hypothetical "Winning Ticket" [58].

Comparison between the classic "train, prune and fine-tune" framework [26], the lottery ticket experiment [18] and learning rate rewinding [58]. (image by author)
Comparison between the classic "train, prune and fine-tune" framework [26], the lottery ticket experiment [18] and learning rate rewinding [58]. (image by author)

3.4 – Sparse training

The previous methods are linked by a seemingly shared underlying theme: training under sparsity constraints. This principle is at the core of a family of methods, called sparse training, which consists in enforcing a constant rate of sparsity during training while its distribution varies and is progressively adjusted. Introduced by Mocanu et al. [47], it involves: 1) initializing the network with a random mask that prunes a certain proportion of the network 2) training this pruned network during one epoch 3) pruning a certain amount of weights of lower magnitude and 4) regrowing the same amount of random weights.

That way, the pruning mask, at first random, is progressively adjusted to target the least import weights while enforcing sparsity all throughout training. The sparsity level can be the same for each layer [47] or global [52]. Other methods have extended sparse training by using a certain criterion to regrow weights instead of choosing them randomly [15, 17].

Sparse training cuts and grows different weights periodically during training, which leads to an adjusted mask that should target only relevant parameters. (image by author)
Sparse training cuts and grows different weights periodically during training, which leads to an adjusted mask that should target only relevant parameters. (image by author)

3.5 – Mask learning

Instead of relying on arbitrary criteria to prune or regrow weights, multiple methods focus on learning a pruning mask during training. Two types of methods seem to prevail in this domain: 1) mask learning through separate networks or layers and 2) mask learning through auxiliary parameters. Multiple kinds of strategies can fit in the methods of the first type: training separate agents to prune as many filters of a layer as possible while maximizing the accuracy [33], inserting attention-based layers [68] or using reinforcement learning [30]. The second kind of methods aims at considering pruning as an optimization problem that tends to minimize both the L0 norm of the network and its supervised loss.

Since the L0 is non-differentiable, the various methods mainly involve circumventing this problem through the use of penalized auxiliary parameters that are multiplied with their corresponding parameter during the forward pass [59, 23]. Many methods [44, 60, 67] rely on a method analogous to that of "Binary Connect" [11], namely: applying stochastic gates over parameters whose values are each randomly drawn from their own Bernoulli distribution of parameter p that is learned using a "Straight Through Estimator" [3] or other means [44].

3.6 – Penalty-based methods

Many methods, instead of pruning connections manually or penalizing auxiliary parameters, rather apply various kinds of penalties to weights themselves to make them progressively shrink toward 0. This notion is actually pretty ancient [57], as weight-decay is already an essential element to the weight magnitude criterion. Beyond using a mere weight-decay, even back then multiple works focused on elaborating penalties specifically designed to enforce sparsity [55, 65]. Today, various methods apply different regularizations, on top of the weight decay, to increase further the sparsity (typically, using L 1 norm [41]).

Among modern works, multiple methods rely on the LASSO (Least Absolute Shrinkage and Selection Operator) [22, 31, 66] to prune weights or groups. Other methods develop penalties that target weak connections to increase the gap between the parameters to keep and those to prune, so that their removal has less impact [7, 16]. Some methods show that targeting a subset of weights with a penalization that grows all throughout training can progressively prune them and make their removal seamless [6, 9, 63]. The literature also counts a whole range of methods built around the principle of "Variational Dropout" [34], a method based on variational inference [5] applied to deep learning [35]. As a pruning method [48], it birthed multiple works that adapt its principle to structured pruning [43, 54].

4 – Available frameworks

If most of these methods have to be implemented from scratch (or can be reused from the provided sources of each paper, if they do provide them), some frameworks exist to apply basic methods or make the aforementioned implementation easier.

4.1 – Pytorch

Pytorch [56] provide multiple quality-of-life features to help pruning networks. The provided tools allow to easily apply a mask to a network and maintain this mask during training, as well as it allows to easily revert that mask if needed. Pytorch also provide some basic pruning methods, such as global or local pruning, whether it is structured or not. Structured pruning can be applied on any dimension of the weights tensors, which lets pruning filters, rows of kernels or even some rows and columns inside kernels. Those in-built basic methods also allow pruning randomly or depending on various norms.

4.2 – Tensorflow

The Keras [10] library from Tensorflow [1] provide some basic tools to prune weights of lower magnitudes. Such as in the work of Han et al [25], the efficiency of pruning is measured in terms of how much does the redundancy, introduced by all the inserted zeros, allows to compress the model better (which combines well with quantization).

4.3 – ShrinkBench

Blalock et al. [4] provide in their work a custom library in an effort to help the community normalize how pruning algorithm are compared. Based on Pytorch, ShrinkBench aims at making the implementation of pruning methods easier while normalizing the conditions under which they are trained and tested. It provides different baselines, such as random pruning, global or layerwise and weight magnitude or gradient magnitude pruning.

5 – Brief recap of reviewed methods

In this article, many papers have been cited. Here is a simple table to roughly summarize what they do and what differentiates them (provided dates are those of first publication):

6 – Conclusion

In our quick overview of the literature, we saw that 1) pruning structures define which kind of gain to expect from pruning 2) pruning criteria are based on various theoretical or practical justifications and 3) pruning methods tend to revolve around introducing sparsity during training to reconcile performance and cost. We also saw that, even though its founding works date back from the late 80’s, neural network pruning is a very dynamic field that still experiences fundamental discoveries and new basic concepts today.

Despite the daily contributions in the domain, there seems to be still plenty of room for exploration and innovation. If each subfamily of method can be seen as an attempt to answer a question ("How to regrow pruned weights ?", "How to learn pruning masks through optimization ?", "How to relax the weight removal by a softer mean ?"…), then the evolution of the literature seems to point out a certain direction: that of sparsity throughout training. This direction raises itself many questions, such as: "do pruning criteria work well on networks that haven’t converged yet?" or "how to tell the benefit of the choice of the weights to prune from that of training with any kind of sparsity from the start?"

References

[1] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):1–18, 2017.

[3] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.

[4] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? arXiv preprint arXiv:2003.03033, 2020.

[5] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.

[6] Miguel A Carreira-Perpinán and Yerlan Idelbayev. "learning-compression" algorithms for neural net pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8532–8541, 2018.

[7] Jing Chang and Jin Sha. Prune deep neural networks with the modified L1/2 penalty. IEEE Access, 7:2273–2280, 2018.

[8] Yves Chauvin. A back-propagation algorithm with optimal use of hidden units. In NIPS, volume 1, pages 519–526, 1988.

[9] Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Compression of deep convolutional neural networks under joint sparsity constraints. arXiv preprint arXiv:1805.08303, 2018.

[10] Francois Chollet et al. Keras, 2015.

[11] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.

[12] Pau de Jorge, Amartya Sanyal, Harkirat S Behl, Philip HS Torr, Gregory Rogez, and Puneet K Dokania. Progressive skeletonization: Trimming more fat from a network at initialization. arXiv preprint arXiv:2006.09081, 2020.

[13] Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In 28th Annual Conference on Neural Information Processing Systems 2014, NIPS 2014, pages 1269–1277. Neural information processing systems foundation, 2014.

[14] Shrey Desai, Hongyuan Zhan, and Ahmed Aly. Evaluating lottery tickets under distributional shifts. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 153–162, 2019.

[15] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840, 2019.

[16] Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. Global sparse momentum sgd for pruning very deep neural networks. arXiv preprint arXiv:1909.12778, 2019.

[17] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pages 2943–2952. PMLR, 2020.

[18] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.

[19] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611, 2019.

[20] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Pruning neural networks at initialization: Why are we missing the mark? arXiv preprint arXiv:2009.08576, 2020.

[21] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.

[22] Susan Gao, Xin Liu, Lung-Sheng Chien, William Zhang, and Jose M Alvarez. Vacl: Variance-aware cross-layer regularization for pruning deep residual networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.

[23] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In NIPS, 2016.

[24] Ghouthi Boukli Hacene, Carlos Lassance, Vincent Gripon, Matthieu Courbariaux, and Yoshua Bengio. Attention based pruning for shift networks. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4054–4061. IEEE, 2021.

[25] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

[26] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.

[27] Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. Robust pruning at initialization.

[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[29] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866, 2018.

[30] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.

[31] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.

[32] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. stat, 1050:9, 2015.

[33] Qiangui Huang, Kevin Zhou, Suya You, and Ulrich Neumann. Learning to prune filters in convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 709–718. IEEE, 2018.

[34] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. stat, 1050:8, 2015.

[35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. stat, 1050:1, 2014.

[36] John K Kruschke and Javier R Movellan. Benefits of gain: Speeded learning and minimal hidden layers in back-propagation networks. IEEE Transactions on systems, Man, and Cybernetics, 21(1):273–280, 1991.

[37] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.

[38] Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip HS Torr. A signal propagation perspective for pruning neural networks at initialization. In International Conference on Learning Representations, 2019.

[39] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. International Conference on Learning Representations, ICLR, 2019.

[40] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.

[41] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pages 2736–2744, 2017.

[42] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2018.

[43] C Louizos, K Ullrich, and M Welling. Bayesian compression for deep learning. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA., 2017.

[44] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l 0 regularization. arXiv preprint arXiv:1712.01312, 2017.

[45] Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR, 2020.

[46] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.

[47] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1–12, 2018.

[48] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.

[49] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019.

[50] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.

[51] Ari S Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. stat, 1050:6, 2019.

[52] Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, pages 4646–4655. PMLR, 2019.

[53] Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pages 107–115, 1989.

[54] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6778–6787, 2017.

[55] Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4):473–493, 1992.

[56] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

[57] Russell Reed. Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4(5):740–747, 1993.

[58] Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.

[59] Pedro Savarese, Hugo Silva, and Michael Maire. Winning the lottery with continuous sparsification. Advances in Neural Information Processing Systems, 33, 2020.

[60] Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 138–145, 2017.

[61] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.

[62] Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33, 2020.

[63] Hugo Tessier, Vincent Gripon, Mathieu Léonardon, Matthieu Arzel, Thomas Hannagan, and David Bertrand. Rethinking weight decay for efficient neural network pruning. 2021.

[64] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2019.

[65] Andreas S Weigend, David E Rumelhart, and Bernardo A Huberman. Generalization by weight-elimination with application to forecasting. In Advances in neural information processing systems, pages 875–882, 1991.

[66] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.

[67] Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. Autoprune: Automatic network pruning by regularizing auxiliary parameters. Advances in neural information processing systems, 32, 2019.

[68] Kohei Yamamoto and Kurato Maeno. Pcas: Pruning channels with attention statistics for deep network compression. arXiv preprint arXiv:1806.05382, 2018.

[69] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067, 2019.

[70] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jin-Hui Zhu. Discrimination-aware channel pruning for deep neural networks. In NeurIPS, 2018.


Related Articles