[ Paper Summary ] Convolutional Networks with Adaptive Computation Graphs

Published in

Towards Data Science

8 min readJul 20, 2018

I am always intrigued by different types of network architecture, and this paper presents very special network architecture.

Please note that this post is for my future self to look back and review the materials on this paper without reading the paper all over again.

Paper from this website

Abstract

Neural Networks usually have a fixed structure, but does it really have to be that way? After a certain layer the network already might be very confident that it is seeing a dog or a cat in the image, but due to the fixed structure it have to use all of the layer and this might hurt its performance. In this paper, the authors presents a novel network Adanets, a convolutional networks with adaptive computation graphs. It achieves better performance on Imagenet data compared to ResNet 34 with fewer parameter and are more robust to adversarial examples.

Introduction

Convolutional Neural Network are used for not only for segmentation but also varieties of other domains, and it is well known fact that the deeper the network gets, it increased the performance of the network. One common fact about these network is the fact that they are all a fixed model agnostic to the input image. However, it have been shown that some layers do not contribute that much to the performance to the network, now the question to be asked is….

Do we really need a fixed structure for convolutional networks, or could we assemble a network graph on the fly, conditioned on the input?

This paper propose Adanets, convolutional networks with adaptive computation graphs. And the general idea can be seen above. One challenge was the fact that each gating unit need to make discrete decision on whether to use the next layer or not, and training a network directly via back propagation is hard. So the authors built upon a recent work that is differentiable approximations for discrete stochastic nodes. The result of this architecture was the birth of the network that is able to generate distinct computational graphs for different high-level categories.

Related Work

The authors work on this paper is actually related to multiple fields, such as neural network composition (Via constructing a computational graphs) , adaptive computation time for neural networks (Via dynamic computation time), regularization with stochastic noise (Via dropping certain layers) and attention mechanisms. (Via selecting specific layers of importance to assemble the computation graph.)

Adanets / Adaptive Computation Graph

First each convolution layer can be expressed mathematically like below….

second resnet can be expressed like below….

and adaptive resnet can be expressed as below…

One important note is the fact that the above formula looks similar to high way network (as seen below) but note adaptive graph network does not have to execute every layer.

Now lets talk about the meat of this network, which is the gating unit. This unit is very important since it have to understand the input data, make discrete decision, and be easy to execute. So the authors have proposed a gate unit with two components, first component estimates the probability that the next layer should be executed and second component takes the estimated probability and draws a discrete sample from it. (General idea can be seen in the figure above.)

As seen above, before being passed to the gate unit, the feature map is reduced via global average pooling on the channel size. (So now the dim is 1*1*C).

Next that feature map is feed into a small fully connected network, to produce a vector with 2 dimension, each representing log-probabilities for computing and skipping the following layer, respectively. After that they use Gumbel-Max trick and in particular its recent continuous relaxation, to perform the discrete operation. (Whether to skip the layer or not.) So the authors did a smart thing here. (To be clear with notations, the beta term above is set to alpha.)

G is a random variable which follow a Gumbel distribution, and during feed forward operation they use the function above. However, during back prop, they use a softmax function. (specifically a differentiable function).

And note, as written above, depending on the term T the softmax function can be treated as argmax function. It is important to note that the author could have only used equation 7 for both feed forward operation as well as back prop, but via experiment they found the optimal configuration.

Training Adanets

Due to the novel architecture the authors had to introduce an additional loss function. There were three things to be considered, 1) the network might learn to use all the layers, 2) some layers might die out and 3) reduced batch size. The authors first used the traditional multi-class logistic loss function, and additionally they introduced target rate loss (seen below) function.

where z denotes the fraction of times layer l is executed within a minibatch, and t is the target rate. So combining these two losses, we obtain something like below.

Additionally, for successful training the network was initialized (at first) to be bias toward opening of the gate and the learning rates for the gated units were reduced.

Experiments

* notes Adanets in which all layers are executed and their output is scaled by the expected execution rate

The authors created Adanet from Resnet 110 as well as created wide Adanet, and they used momentum optimizer with weight decay of 5e-4, with 350 epochs and mini batch size of 256. And as seen in above table, we can see that Adanets are able to outperform different Resnet while reducing computation time.

To investigate how the network allocates computation the authors have plotted the execution rates for different classes of images. From above we can see that down-sampling layers are crucial, wide Adanet shows more variation between classes indicating that more capacity might help individual layers specialize on certain subsets of the data, and most inter-class variation comes from later layers of the network.

Next the authors tested Adanet on Imagenet data set, and as seen above Adanet is able to reduce computation cost. And one thing to note is that Adanet was able to outperform Resnet 34 in which had smaller parameters but larger computational price.

Plotting the execution rates, the authors were able to observe much wider range of execution rates, (layers 9 and 10 are rarely used for certain classes), down-sampling layer / last layer are crucial, execution rates differs significantly for man made objects vs animals on latter layers.

When we plot the execution rates of different layers for the first 30 epoch, we can directly observe the fact that the layers quickly separates into key layers and less critical layers.

When we plot a histogram of how many number of layers are used for different classes we get something like above. On average 10.81 layers are executed with a standard deviation of 1.11. However, as seen above images of birds use one layer less than images of consumer goods. (Super interesting).

Robustness to Adversarial Attacks

To know the effect of how the specialized layers performs on adversarial attacks the authors used Fast Gradient Sign Attack in order to create adversarial examples and gave it to the network. (The authors also performed JPEG compression defense on the created adversarial examples.). And as seen above, Adanets are more robust to regular resnets.

Next to see if adversarial examples effect the execution rates for different layers, the authors have plotted the execution rate bar graph for regular bird image and adversarial one. And we can see that execution rates are not that effected.

Conclusion

In conclusion, the authors of this paper have introduced a novel network architecture called Adanets. Which have the ability to learn which layers to be executed depended on the input data. Via multiple experiments the authors have founded that not only this type of architecture outperforms regular resnets, but also are more robust to adversarial attacks.

Final Words

Very clever and novel architecture design, I was especially amazed to know how the network is able to differentiate between man made objects vs animals.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also implemented Wide Residual Networks, please click here to view the blog post.

Reference

(2018). Arxiv.org. Retrieved 19 July 2018, from https://arxiv.org/pdf/1711.11503.pdf
Veit, A., & Belongie, S. (2017). Convolutional Networks with Adaptive Computation Graphs. Arxiv.org. Retrieved 19 July 2018, from https://arxiv.org/abs/1711.11503

[ Paper Summary ] Convolutional Networks with Adaptive Computation Graphs

Written by Jae Duk Seo