The world’s leading publication for data science, AI, and ML professionals.

How to create more Efficient Deep Learning Models

In this article I am going to discuss several important approaches and techniques that can help improve the efficiency of neural networks.

An overview of approaches and techniques

Photo by Thomas Kelley on Unsplash
Photo by Thomas Kelley on Unsplash

In this article, I am going to present and discuss several important approaches and techniques that can help improve the efficiency of a Deep Learning model on different levels. These types of optimizations are becoming even more important now as the new improvements in deep learning models also bring an increase in the number of parameters, resources requirements to train, latency, storage requirements, etc.

The main topics that I will tackle are the following:

  • Compressions Techniques
  • Learning Techniques
  • Efficient Architectures
  • Automation

Compression Techniques

These types of techniques are targeting the representational efficiency of the entire model, this is possible mainly because many state of the art models are over-parameterized. Multiple components of the main model can benefit without affecting (within a margin) the scoring results computed on the original version, such as training time, inference latency, memory footprint. By compressing some part of the computation graph of a neural network model we can also improve its generalizability.

Multiple ideas have been explored here, some of the most successful are:

  • Pruning – refers to the removal or setting to zero of a set of parameters from a neural network using various strategies to pick the affected weights to obtain a sparse network that would not need as much memory as before. The most popular pruning strategies are based on: saliency, random structured/unstructured, scheduling, distribution of sparsity budget, regrowth. Usually after pruning it is recommended to perform fine-tuning on the resulting model.

Example of model pruning in PyTorch:

import torch.nn.utils.prune as prune
model = Model()
layers_to_prune = (
    (model.conv1, 'weight'),
    (model.fc1, 'weight')
)
// This will prune 20% of the parameters with lowest L1-norm
prune.global_unstructured(
    layers_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.2,
)
  • Quantization – reduces the precision of the data types that are used for the weights and activations of the model (for example: reduce from 32-bit floating-point values to 8-bit fixed-point). Most of the time we can see improvements both regarding the memory footprint and the latency when we apply quantization. Generally, there are two types of quantization: post-training and quantization aware training; I think that the terms are mostly self-explanatory (at least in a high-level view), the only thing that needs to be mentioned is that the first one could affect the quality of the model for inference.

PyTorch has multiple quantization strategies, and here’s the most simple to use:

quantized_model= torch.quantization.quantize_dynamic(
    model, 
    qconfig_spec={torch.nn.Linear}, 
    dtype=torch.qint8
)
//qconfig_spec specifies the list of submodule names in model to apply quantization to.
  • Matrix-based Compression Techniques: Low-Rank Approximation, Dictionary Learning, Layer Splicing, etc.

Learning Techniques

These types of techniques try to enhance the quality of the model by changing some aspects of the training procedure. By only targeting the training phase the validation/test scoring should remain representative for production.

Type of learning techniques:

  • Distillation – for this method we introduce the idea of "student" network and "teacher" network. Basically, we have one or an ensemble of larger networks that are "teaching" the smaller network to reproduce the entire process or just some intermediate representation. We can also use the teacher network to create soft labels that we can use in the loss function alongside ground-truth labels, the idea is that the soft labels might capture some relationship between classes and this can help in the training.
  • Data Augmentation – when working with deep models we usually need a large number of samples to make sure that our model can generalize. However, this can sometimes be a problem due to the high cost or the scarcity of that specific data type. A possible improvement to the dataset size problem comes in the shape of data augmentation, which is basically a set of methods of generating synthetic samples by applying some sort of transformations or interpolation. Most data augmentation techniques target computer vision tasks, some examples of that would be: resize, rotation, flip, crop, etc.

In PyTorch we can stack multiple types of transformations and use them directly in our custom Dataset class:

train_transforms = A.Compose(
    [
        A.Resize(width=320, height=320),
        A.RandomCrop(height=728, width=728),
        A.HorizontalFlip(p=0.5),
        A.VerticalFlip(p=0.5),
        A.RandomRotate90(p=0.5),
        ToTensor(),
    ]
)
  • Self-Supervised Learning – represents one of the most interesting solutions to the "old" problem of not having a large enough labeled dataset. When applying a Self-Supervised Learning method we are creating a "pretext task" using our unlabeled data that allows us to generate good representations that can be later used for more specific tasks. Once we have good enough embeddings we can add a prediction head and fine-tune the model with the labeled data. For example, in NLP it is rather common to apply Self-Supervision to predict a masked word in an unlabeled sentence. In CV there is the concept of Contrastive Learning, where a model is trained to distinguish between images that are different. Due to the low requirements of labeled data, this technique is considered to be data-efficient.

Efficient Architectures

One other way of improving the efficiency of a deep learning system is to take a step back and try to tackle the problem at the level of the model’s architecture, some neural networks layer designs being better suited for specific tasks or data types.

Next, I will try to show you several examples of model architecture designs that brought several improvements for computer vision and natural language processing:

  • Computer Vision: Convolutional Layers – this type of layer has revolutionized the field of computer vision. It takes advantage of the spatial locality of image features and by stacking them it creates multiple levels of representations thus allowing the detection of more complex features in the later layers. Furthermore, because the operation of convolution reuses the same filter for the whole image this has also considerably reduced the number of parameters of the models.
  • Natural Language Processing: Transformers – have brought enormous improvements to the field of NLP (beginning with Attention Is All You Need, Ashish Vaswani et al.), the main advantage of using this type of neural network is that it removes the bottleneck of having just a single feature vector for representing the context of the entire input sequence. The Transformer architecture uses self-attention and cross-attention to encode a context for each input element of the sequence. Here is a great course that explains the basics and usage of the Transformers.

Automation

Another approach to finding new avenues to improve the efficiency of Machine Learning models is to "brute force" the search of different ideas by using different automatic search techniques. The big disadvantage is that the search-based approach requires a large number of computational resources and time.

We can split the types of automation by considering the level of search spaces:

  • Hyperparameter Optimization (HO) – as the name suggests this type of automation tries to search for a more efficient model by changing the values of some hyperparameters such as learning rate, number of layers, weight decay, batch size, etc. Even though we use a KFold splitting strategy to iterate over different values of hyperparameters per fold, it would still require a lot of computations to go through them all. There are several search strategies that we can follow: Grid Search, Random Search, Bayesian Search, Coarse-to-Fine Search.
  • Neural Architecture Search (NAS) – this can be considered an extension of hyperparameter optimization, allowing for additional elements in the search space like different operation blocks (Convolutions, Linear Layers, Pooling) and different ways in which to combine them. Furthermore for NAS researchers have also used Reinforcement Learning to search for better architectures.

As a final word, the list that I have gone through in this article is by no means comprehensive and many more ongoing research efforts try to improve every bottleneck of a deep learning system.

Thank you for reading and if you want to stay up to date with the latest machine learning news and some good quality memes :), you can follow me on Twitter here.


Resources


Related Articles