Inside AI

Intuitive Explanation of Differentiable Architecture Search (DARTS)

Understanding how DARTS work!

This is a paper that came out in the midst of 2018, addresses the problem of scalability of searching a network architecture. These papers address the problem of Neural Architecture Search or NAS in short.

As the name suggests, the idea behind this field is to explore how can we automatically search deep learning model architectures. Currently, most of the data science problems are solved by manually designing the model architecture which gives “state of the art” results on any given dataset. The problem with this approach is that, though these architectures perform really good on the standard datasets, they don’t perform as expected on the organisation specific datasets.

Top Left: Unet-Architecture | Top Right: Original Image published in [LeCun et al., 1998] | Bottom left: VGG16 Architecture | Bottom Right: ResNet architecture

This article is for those who are stepping into NAS research or reading this wonderful paper. I worked on this field as an internship project at the Indian Space Research Organisation (ISRO). In this blog, I will try to explain this paper in an intuitive manner as I experienced a lot of difficulties when I implemented it for semantic segmentation.

Introduction to Neural Architecture Search (NAS)

The problem of neural architecture search is posed as follows.

Given a set of search space operation O, we need to find the combination of these operations that maximizes or minimizes the objective function.

Dog Image cc-by: Von.grzanka | Image by Author | Animation showing how different operations affect output.

In simple words, we need to find the architecture of the model to minimize the loss.

Naïve Solution

A naive solution to NAS is trail and error. We will randomly select a subset of operation and evaluate its performance on a parameter like validation loss and select the model configuration with the best performance.

Brief History of progress in NAS

We will not get into depth, but here are some influential papers that paved the way for NAS research.

And then recently there is HNAS: Hierarchical Neural Architecture Search on Mobile Devices, which extended the idea for DARTS to next level.

The trend in the research has been to decrease the computation time from 2000 GPU days of reinforcement learning or 3150 GPU days of evolution to 2–3 GPU days for DARTS.

Approach for NAS

The idea of searching for high performing model architecture is not trivial and involves two steps.

  1. Searching the cell architecture on a small dataset (e.g. CIFAR10 or CIFAR100)
  2. Making model from searched cell architecture and training it on a big dataset (e.g. ImageNet)

Searching Cell Architectures

Image by Author | Structure of a Simple Cell and mixed operations. The cell above shown has 3 states in a stacked fashion.

What is a cell in a model anyways? Well, a cell can be considered a special block where layers are stacked just like any other model. These cells apply many convolution operations to get feature maps which can be passed over to other cells. A model is made by stacking these cells in a series to make a complete model. All these papers follow a pattern, where two types of cell structure are searched, namely Normal Cell and Reduction Cell.

Image by Author | Normal Cell

Normal Cell: Normal Cell can be thought of a normal block which computes the feature map of an image. Convolutions and poolings in this block have a stride of 1.

Image by Author | Reduction Cell

Reduction Cell: Reduction Cell can be thought of normal block which reduced the feature map dimensions. Convolutions and poolings in this block have a stride of 2. The purpose of the reduction cell is to downsample the feature maps.

Since all these papers tackle the classification problem, a global average pooling layer is used at last along with optional fully connected layers.

Image by Author | Normal and Reduction cells stacked to form the final model after the search phase.

Details about DARTS. How it is better?

Darts is a very influential paper in neural architecture search. Earlier methods used reinforcement learning and required a large number of computational resources. It took 2000 GPU days of reinforcement learning or 3150 GPU days of evolution. This computation time is not at all feasible by most of the organizations.

In this work, we approach the problem from a different angle and propose a method for efficient architecture search called DARTS (Differentiable Architecture Search). Instead of searching over a discrete set of candidate architectures, we relax the search space to be continuous, so that the architecture can be optimized with respect to its validation set performance by gradient descent.

The data efficiency of gradient-based optimization, as opposed to inefficient black-box search, allows DARTS to achieve competitive performance with the state of the art using orders of magnitude fess computation resources.

We introduce a novel algorithm for differentiable network architecture search based on bilevel optimization, which is applicable to both convolutional and recurrent architectures.” — source: DARTS Paper

DARTS reduced the search time to 2–3 GPU days which is phenomenal.

How does DARTS do this?

  1. Searching over a discrete set of candidate operations is computationally heavy.

The problem with searching over a discrete set of candidate operations is that model has to be trained on specific configuration before moving onto the next configuration. This obviously is time-consuming. Authors found a way of relaxing the discrete set of candidate operations.

“To make the search space continuous, we relax the categorical choice of a particular operation to a softmax over all possible operations:” — DARTS paper

Equation from DARTS paper

Well what this means is that assume we have few operations in our candidate operations namely

O = {conv_3x3, max_pool_3x3, dilated_conv_5x5}.

The output of the operation is called mixed operation and is defined by multiplying the output from these operations to their probabilities.

Image by Author | Image showing how Mixed Operation is computed.

“Each intermediate node is computed based on all of its predecessors.” — DARTS paper

Equation from DARTS paper
Image by Author | Typical NAS Cell | Note how each node has output from all the previous nodes as its input.

This brings us to the structure of the cell of DARTS. This is the core structure of the model I want you to give a good focus here.

The cell contains one or more nodes. These nodes are also known as states.

nput to a cell is outputs of the last two-cell, just like ResNets. In this cell there are nodes. Let us assume we make a cell with 3 states/nodes. So the first node will have two inputs i.e outputs from the last two cells.

The second state will have inputs from the first state, and outputs from the last two cells so total 3 inputs.

The third state will have inputs from the second state, the first state and outputs from the last two cells.

At the end of the search, a discrete architecture can be obtained by replacing each mixed operation o(i,j) with most likely operation i.e.

Equation from DARTS paper.
Image from DARTS Paper

This sound complex, but let’s break it down. After searching phase is over, we can find the architecture of a cell by getting top k (generally k=2) connections from the cell. This way discrete search space is converted to continuous search space on which gradient descent algorithm will work fine.

2. Bilevel Optimization

“After relaxation, our goal is to jointly learn the architecture α and the weights w within all the mixed operations (e.g. weights of the convolution filters).” — DARTS Paper

We have discussed how can we obtain the searched architecture. But how does this model search the optimal operation is still a question unanswered. The training part is still left.

The optimization problem can be posed as to finding alphas so that validation loss is minimized given that we have weights that are already optimized on the training set.

Note how the optimization of alphas and layer weights are done on training and validation set. This is called bilevel optimization.

Approximate Architecture Gradient — The elephant in the room

“Evaluating the architecture gradient exactly can be prohibitive due to the expensive inner optimization. We, therefore, propose a simple approximation scheme as follows:” DARTS Paper

Image by Author | Equation showing gradients of alphas
1st (Cat) Photo by Loan on Unsplash, Second (Dog) Photo by Victor Grabarczyk on Unsplash, Third (Dog) Photo by Alvan Nee on Unsplash | Image by Author | Notice how changing alphas (orange lines) change the training loss (top graph) and retraining till convergence has to be done on weight. Optimizing over alphas need optimized weights first.

There is a computational problem in this equation. To get optimal convolution weights we need to train the network by minimizing training loss by updating convolution weights. This means every time alpha is updated minimization of training step is required. This will make network training infeasible.

“The idea is to approximate w*(α) by adapting w using only a single training step, without solving the inner optimization completely by training until convergence.”

Equation from DARTS paper

In equation 5 getting optimal weight w* for each configuration of alpha leads to two loops of optimization, so authors suggested to approximate w* in such a way that there is no need to optimize w* till convergence. The idea is to use just one training step instead of whole inner optimization loop.

Image from DARTS Paper

Click here to understand the math behind these equations.

Looking at equation 7, we have a second-order partial derivative which is computationally expensive to compute. To solve this, the finite difference method is used.

Look, at equation 8 there is no second-order partial derivative!

For results, you can refer to paper here.

Alternative Optimization Strategies

Authors also tried to optimize alphas and weights jointly on training+validation data, but results deteriorated. The authors explained that this could be due to overfitting of alphas on the data.

Conclusion

DARTS was a very influential paper that drastically reduced the time of searching a high performing architecture from thousands of GPU hours to just 2–3 GPU days and still achieving state-of-the-art results.

Resources

--

--