What is the technology behind DeepScale?

Tesla’s latest acquisition has the best tech for designing DNNs

Giuliano Giacaglia

Published in

Towards Data Science

6 min readOct 13, 2019

The problem

DeepLearning is taking over the world, but the most performant models out there are getting bigger and bigger. These big networks require lots of resources, not only in terms of power and usage, but also they require a lot of energy and time to produce the results that we want.

For computation that is at the edge and produces great results, there is a need for smaller networks. The smaller the network, the faster it leads results, and the less resources it needs (less amount of GPUs).

In order to produce better results, a few techniques to produce better results with smaller networks have been developed. Some of them are described in a blog post by Hugging Face, where they described how they distilled the latest versions of Bert, the best performing neural network for NLP tasks.

DeepScale uses another technique to find neural networks that are small and still produce great results. DeepScale performs Neural Architecture Search. Neural Architecture Search has the promise of dramatically reduce both the engineering hours and the computing hours required to develop a entirely new neural network that is optimized for specific computing requirements and error rates (or accuracy requirements).

Solutions

The problem is defined as to find networks that are the best tradeoff of compute and accuracy in some search space
You can vary the number of layers, channels, kernel size, connections, etc..

If you were to search for every single neural network, then the search space becomes intractable to train every single neural network that is created in this search space if there are a few options for each one of them.

In order to find the right neural network, researchers tried a few different tricks to go towards the right neural network architecture. Researchers optimized the search process with a few different strategies, with these different options:

Random Search
Genetic Search
Reinforcement Learning
Differential Search (Gradient Based)

In this post, we are going to go over the last two search methods, Reinforcement Learning and Differential Search, since they currently are the ones that present the best results with a finite amount of compute resources.

Reinforcement Learning

One of the first pieces of work that started the field was a paper from Google Brain called Neural Architectural Search with Reinforcement Learning. It can be found here.

The left image shows the controller that will update the network parameters

The idea behind it is that the network is found by using a reinforcement learning loop to promote the network that has the best accuracy and that it uses the least amount of compute. The network is found by using a controller, which is a RNN based network.

The network is generated by varying the number of filters, height, width, stride, etc… The process works by generating a neural network, training it, getting its accuracy, and using it as feedback for the controller to generated better candidates.

Using this technique, the Google team achieved marginally better results than state of the art (SOTA) at the time, but the compute it used was enormous. It used 800 Nvidia K40 GPUs for 28 days, or half a million GPU hours. The search was performed on a small dataset, so it is hard to see how it would perform with a bigger dataset.

Constraining the search

Instead of doing a search over all possible networks, the same group decided to constraint the search on each cell, i.e. they fixed the network topology and did a search for each cell. The results were much better. The resulting network had 20% better accuracy and it was 28% faster on ImageNet1000.

For searching the neural network, the group used much less resources. They used around 50k GPU hours. In this setting, all the cells are the same. This was still too much compute to be practical.

Differential Search

The best results in terms of accuracy and compute resources was found by a group at Facebook that used Differential Search to find the right architecture. The idea behind their trick was to use a very big network that contains many littler experiments (inside it) rather than having a controller generate little experiments to learn from it.

The group used a gradient based stochastic SuperNet that dually optimizes the convolutional weights of the network as well as the parameters of choosing of individual units. It uses a Gumbel-Softmax to sample from a categorical distribution that is weighted by the learned parameters of the architectural parameters. The Gumbel-Softmax trick is a way to find parameters of a neural network where we have a discrete set of values, and you want to perform back-propagation. To learn how it works, I recommend reading the following

Neural Networks gone wild! They can sample from discrete distributions now!

Training deep neural networks usually boils down to defining your model's architecture and a loss function, and…

anotherdatum.com

The Facebook team trained on a super network to find the weights of the network and trained it on a 10% subset of ImageNet, or ImageNet1000.

The results were impressive. They achieved MobileNetV2 accuracy on ImageNet classification, while having 1.5x lower latency. The search cost was only 216 GPU (P100) hours.

DeepScale solution

DeepScale took a similar approach to apply NAS (Neural Architectural Search) to design DNNs for semantic segmentation. Semantic segmentation is used pretty widely by self-driving car companies.

Image Segmentation is much harder than image classification. It requires much more competing resources. For image classification the state-of-the-art neural networks range around 10 GFLOPs of compute resources, while for Image Segmentation, the SOTA networks are around 1 TFLOPs.

DeepScale used the same technique as FBNet to create a neural network for Image Segmentation, that achieved SOTA accuracy and it was much more performant. They performed a neural architectural search over the encoder of the network.

Here are the details of DeepScale’s implementation

SqueezeNAS implementation details:

13 candidates per unit for 22 layers (10²⁴ possible networks)
Randomly initialize SuperNetwork and train only convolution weights on, then switch to alternating pattern.

After SuperNetwork has converged:

Sample candidate networks via the Gumbel-Softmax trick
Run each sampled candidate network on the validation set of Cityscapes
Put the best candidates through the full training regime

After all of that, DeepScale chose the best network given the network accuracy and compute resources. DeepScale found networks that have the best performance and require less compute resources to achieve these results, for example their SqueezeNAS-MAC-Large network achieved more than 2.5% higher absolute mIOU compared to the MobileNetV2[40] segmentation network, which has more than double the MACs of their network.

DeepScale search for each network was found in less than 15 GPU-days, which is more than 100 times less than some reinforcement learning and genetic search methods.

Conclusion

All in all, neural architecture search seems to lead the best results for neural networks that want to achieve SOTA results and at the same time require few compute resources. We will probably see more and more innovation in this space, resulting in better networks.