NeurIPS 2020 Papers: Takeaways for a Deep Learning Engineer

Techniques and insights for applied deep learning from papers published at NeurIPS 2020

Published in

Towards Data Science

11 min readNov 27, 2020

Advances in Deep Learning research are of great utility for a Deep Learning engineer working on real-world problems as most of the Deep Learning research is empirical with validation of new techniques and theories done on datasets that closely resemble real-world datasets/tasks (ImageNet pre-trained weights are still useful!).

But, churning a vast amount of research to acquire techniques, insights, and perspectives that are relevant to a DL engineer is time-consuming, stressful, and not the least overwhelming.

For what so ever reason, I am crazy (I mean, really crazy! See Exhibit A here and here) about Deep Learning research and also have to justify a Deep Learning engineer’s role to earn my living. So, this is a great place to be in to cater to these needs of DL engineer relevant research churning.

Therefore, I went through all the titles of NeurIPS 2020 papers (more than 1900!) and read abstracts of 175 papers, and extracted DL engineer relevant insights from the following papers.

Now, sit back and enjoy.

This is part 1. See the other parts below.

NeurIPS 2020 Papers: Takeaways of a Deep Learning Engineer (Part 2 of 3)— Computer Vision

Techniques and insights for applied deep learning (computer vision) from papers published at NeurIPS 2020

towardsdatascience.com

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

2.5x faster pre-training with Switchable Transformers(ST) compared to standard Transformers.

Equipped by Switchable Gates (G in the fig. below), some of the layers are skipped randomly according to sampled 0 or 1 from a Bernoulli distribution which is 25% time-efficient per sample.

(a) Standard Transformer (b) Reordering to make it PreLN (c) Switchable Gates (G) to decide whether to include a layer or not. (The image is reproduced from the pdf of the current paper.)

And remarkably, it is shown that it reaches the same validation error as the baselines with 53% fewer training samples.

Combining both time and sample efficiency, pre-training is 2.5x faster with comparable and sometimes better performance on downstream tasks.

Takeaway: When you want to pretrain or finetune a transformer, try out Switchable Transformers for faster training along with low inference times.

Coresets for Robust Training of Neural Networks against Noisy Labels

It is shown before that the Jacobian of neural network weights(W) and clean data (X), after some training, would approximate to a low-rank matrix, with a few large singular values and a lot of very small singular values.

Also, learning which generalizes (i.e from clean data) is in a low-dimensional space called Information space (I) and learning that doesn't generalize (i.e. from noisy labels, mostly memorization) is in a high-dimensional space called Nuisance space (N).

The current work introduces a technique that creates sets of mostly clean data (Coresets) to train a model with and show a significant increase in performance on noisy datasets i.e. 7% increase on mini Webvision with 50% noisy labels compared to the state-of-the-art.

The method introduced in this work, CRUST, performs significantly better than the state-of-the-art. (The image is reproduced from the pdf of the current paper.)

Takeaway: When you suspect the dataset you collected has noisy/mislabeled data points, use CRUST to train the model only on the clean data and improve performance and robustness.

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

There exists a sub-network that exhibits performance comparable to the original complete network while the training process is the same. These sub-networks are called lottery tickets and are defined by masks that tell which weight is zeroed out in the original network.

Current work adopted Iterative Magnitude pruning (IMP) which trains a subnetwork for some time and prunes k% weights which are of less magnitude. This process is repeated multiple times until the sparsity reaches the target sparsity. Important thing is that after every iteration of training the model starts again with the initial parameters rather than weights updated till then, which is called rewinding.

Here, the pre-trained weights of the BERT are the initialization we start IMP with. And the lottery ticket which is a subnetwork of the pre-trained BERT also contains the same pre-trained weights with some of them zeroed out.

This work showed that the lottery ticket hypothesis holds for pre-trained BERT models as well. And found subnetworks at 40% to 90% sparsity for a range of downstream tasks.

The last row corresponds to the approach introduced in this paper. Even though it is 40%-90% sparse, performance is comparable to a Full BERT base. (The image is reproduced from the pdf of the current paper.)

Also, the authors found a pre-trained BERT ticket with 70% sparsity which can transfer to many downstream tasks and perform at least as good as or better than a 70% sparse ticket found for that particular downstream task.

Last but one row, (IMP) MLM (70%), shows that there is a general 70% sparse BERT which would generalize to all the downstream task being at least as good as 70% sparse tickets of that particular task. (The image is reproduced from the pdf of the current paper.)

Takeaway: A Deep Learning engineer working on NLP has to finetune pre-trained BERT on a downstream task very often. Instead of from a full-size BERT, start fine-tuning with the 70% sparse lottery ticket found on MLM downstream task (last but one row) to train faster and decrease inference times and memory bandwidth without losing out on performance. It’s a no-brainer!

MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet is a hybrid of Masked Language Modeling(MLM) and auto-regressive Permuted Language Modeling(PLM) adopting the strengths and avoiding their limitations from each of its constituents.

Masked language modeling, as in BERT-style models, mask out ~15% of the data and try to predict those masked tokens. As the dependency between the masked tokens is not modeled it leads to pretrain-finetune discrepancy which is termed as Output Dependency.

On the other side, auto-regressive permuted language modeling, as in XLNet, doesn’t have entire information about the input sentence i.e when predicting say 5th element in the 8-element sequence the model doesn’t know that there are 8 elements in the sequence, thus lead to pretrain-finetune discrepancy (as the model see entire input sentence/paragraph in the downstream tasks) which is termed as Input Consistency.

MPNet combines both of them. XLNet-like architecture is modified by adding additional masks up to the end of the sentence so that prediction at any position would attend to N number of tokens where N is the length of the sequence, with some of them being masks.

Illustrative example showing how MPNet combines MLM and PLM. (The image is reproduced from the pdf of the current paper.)

They use two-stream self-attention which is introduced in XLNet to enable auto-regressive type prediction, at one go, where at any position content should be masked for prediction at that step but should be visible for the predictions at later steps.

“MPNet outperforms MLM and PLM by a large margin and achieves better results on tasks including GLUE, SQUAD compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa).”

Takeaway: If you ever wanted to pretrain a language model on your domain-specific data or with extra data than the state-of-the-art, use MPNet which is shown to have the best of both MLP and PLM worlds.

Identifying Mislabeled Data using the Area Under the Margin Ranking

Mislabeled data is common in large-scale datasets as they are crowdsourced or scraped from the internet which is noise prone.

This work formulates a simple intuitive idea. Let's say there are 100 dog images but 20 of them are labeled as ‘bird’. And similarly, 100 bird images but labeled 20 of them are labeled as ‘dog’.

After some training, for an image of a dog wrongly labeled as ‘bird’, the model gives a considerable probability for label ‘dog’ because of generalization from 80 correctly labeled images. The model also gives a considerable probability for the label ‘bird’ as well because of memorizing those 20 wrongly labeled images.

Now, the difference between the probability of ‘dog’ and the probability of ‘bird’ is called Area Under the Margin (AUM). This work recommends that if AUM is below some pre-defined threshold we should treat it as a wrongly labeled data sample and remove it from training.

If we can’t able to settle on one threshold value, we can populate wrongly labeled data intentionally and see what is AUM for those examples. This would be our threshold.

“On the WebVision50 classification task, this method removes 17% of training data, yielding a 1.6% (absolute) drop in test error. On CIFAR100 removing 13% of the data leads to a 1.2% drop in error.”

Takeaway: When creating a dataset, noisy/mislabeled data samples are mostly unavoidable. Then, use the AUM method to find the mislabeled data samples and remove them from the final training dataset.

Rethinking the Value of Labels for Improving Class-Imbalanced Learning

Do we need labels when existing labels are class imbalanced (some classes have more labeled examples than others) and we have a lot of unlabeled data?

Positive. Yes, we need labels. Self-train on the unlabeled data and you would be golden. (Self-training is a process where an intermediate model, which is trained on human-labeled data, is used to create ‘labels’ (thus, pseudo labels) and then the final model is trained on both human-labeled and intermediate model labeled data).

Negative. We may do away with the labels. One can use self-supervised pretraining on all the data available to learn meaningful representations and then learn the actual classification task. It is shown that this approach improves performance.

Takeaway: If you have class-imbalanced labels and more unlabeled data, do self-training or self-supervised pretraining. (It is shown that self-training beats self-supervised learning on CIFAR-10-LT though).

Big Bird: Transformers for Longer Sequences

https://neurips.cc/virtual/2020/public/poster_c8512d142a2d849725f31a9a7a361ab9.html

Self-attention in standard Transformers is of quadratic complexity (in memory and computation) wrt sequence length. So, training longer sequences is not feasible.

Enter Big Bird. It uses sparse attention where a particular position only attends to a few randomly selected tokens and some neighboring tokens.

That’s not what makes it work though. Big Bird has multiple CLS tokens that attend to the entire sequence. And a token in any position attends to these CLS tokens which give them relevant context, dependencies, and who knows what else self-attention layers learn.

Different types of attention in sparse attention (a) Random attention (b) Window neighborhood attention (c) Global attention on added CLS tokens. (The image is reproduced from the pdf of the current paper.)

“Big Bird’s sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, Big Bird drastically improves performance on various NLP tasks such as question answering, summarization, and novel applications to genomics data.”

Takeaway: If you are working with longer sentences or sequences like in summarization or applications of genomic data, use Big Bird for feasible training and respectable inference times. Even with smaller sentences, use Big Bird. I will take linear-complexity self-attention rather than quadratic any day!

Improving Auto-Augment via Augmentation-Wise Weight Sharing

https://neurips.cc/virtual/2020/public/poster_dc49dfebb0b00fd44aeff5c60cc1f825.html

Choosing a sequence of transformations and their magnitude for data augmentation for a particular task is domain-specific and time-consuming.

Auto-Augment is a technique to learn an optimal sequence of transformations where the reward is the validation loss negated. Usually, RL is used to learn this policy. One iteration in learning this optimal policy involves training a model completely and thus is a very expensive process.

So, the current work tries to make this process more efficient. It is based on the insight shown before that while training with a sequence of transformations the effect of the transformations is only prominent at the later stage of training.

In this current work, for each iteration to evaluate a particular policy (sequence of transformations), most of the training is done with a shared policy, and only the last part of the training is done with the current policy to be evaluated. This is called Augmentation-Wise Weight Sharing.

As the training with the shared policy is done only once for all the iterations this method is efficient in learning an optimal policy.

Two stages of training the model when evaluating the given policy. (The image is reproduced from the pdf of the current paper.)

“On CIFAR-10, this method achieves a top-1 error rate of 1.24%, which is currently the best performing single model without extra training data. On ImageNet, this method gets a top-1 error rate of 20.36% for ResNet-50, which leads to a 3.34% absolute error rate reduction over the baseline augmentation.”

Takeaway: When you have resources to use an optimal sequence of data augmentations to increase the performance of a model, use this method to train the RL agent which learns the optimal policy which is more efficient also making Auto-Augmentation feasible for large datasets.

Fast Transformers with Clustered Attention

https://neurips.cc/virtual/2020/public/poster_f6a8dd1c954c8506aadc764cc32b895e.html

Like Big Bird above, Fast Transformers approximates the standard self-attention to make it linear from quadratic dependency.

To do this, instead of calculating attention all-to-all (O(sequence_length*sequence_length)), queries are clustered and the attention values are calculated only for the centroids. And all the queries in a particular cluster would get the same attention values. This makes the overall computation of self-attention linear wrt sequence length. O(num_clusters*sequence_length).

To improve this approximation by handling the case where there could be some keys which have a large dot product with the centroid query but not with some of the cluster member queries, authors take top-k keys which the centroid query most attended to and calculate the exact key-value attention values for all the queries in the cluster with those top-k keys. This increases computation and memory but still is better than all-to-all.

“This paper shows that Fast Transformers can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pre-trained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.”

Takeaway: This is not as elegant as Big Bird we saw above but one has to try every option to bring the quadratic complexity of self-attention to linear.

Limits to Depth Efficiencies of Self-Attention

https://neurips.cc/virtual/2020/public/poster_ff4dfdf5904e920ce52b48c1cef97829.html

To scale transformers, it is empirically shown that increasing the width (dimension of internal representation) is as efficient as increasing the depth (number of self-attention layers).

Contrarily and more concretely, this work establishes that we can scale the transformers up to the ‘depth threshold’ which is the base 3 logarithm of the width. If the depth is below this depth threshold increasing depth is more efficient than increasing the width. This is termed depth efficiency.

And if the depth is higher than this depth threshold increasing depth will hurt compared to increasing the width. This is termed as depth inefficiency.

The number of parameters is directly proportional to the width of the network when layers are constant. The figure shows when the depth is more useful when it has sufficient width, i.e in depth efficiency phase. (The image is reproduced from the pdf of the current paper.)

“By identifying network width as a limiting factor, our analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity.”

Takeaway: When you want to scale the Transformer architecture for the next big language model, keep in mind that if the width is not large enough increasing depth doesn’t help. Depth should always be less than the ‘depth threshold’ which is the base-3 logarithm of width. So, increase the width before increasing the depth to scale your transformers to almost insane depth.