History of Convolutional Blocks in simple Code

Published in

Towards Data Science

10 min readJul 25, 2018

I try my best to read ML and AI related papers on a regular basis. It’s the only way to stay up-to-date with recent advancements. As a computer scientist I often hit a wall when going over the scientific description texts or the mathematical notation of formulas. I find it so much easier to understand it in plain code. So in this article, I want to guide you through a curated list of important convolutional blocks from recent architectures implemented in Keras.

When you look for implementations of popular architectures on GitHub you will be surprised by how much code they contain. It’s good practice to include a sufficient amount of comments and enhance the model using additional parameters but at the same time this can distract from the essence of the architecture. In order to simplify and shorten the snippets even more, I’m going to use a few alias functions:

def conv(x, f, k=3, s=1, p='same', d=1, a='relu'):
  return Conv2D(filters=f, kernel_size=k, strides=s, 
                padding=p, dilation_rate=d, activation=a)(x)def dense(x, f, a='relu'):
  return Dense(f, activation=a)(x)def maxpool(x, k=2, s=2, p='same'):
  return MaxPooling2D(pool_size=k, strides=s, padding=p)(x)def avgpool(x, k=2, s=2, p='same'):
  return AveragePooling2D(pool_size=k, strides=s, padding=p)(x)def gavgpool(x):
  return GlobalAveragePooling2D()(x)def sepconv(x, f, k=3, s=1, p='same', d=1, a='relu'):
  return SeparableConv2D(filters=f, kernel_size=k, strides=s, 
                padding=p, dilation_rate=d, activation=a)(x)

I find the code a lot more readable when getting rid of the template code. Of course, this only works once you understand my one-letter abbreviations. Let’s get started.

Bottleneck Block

The number of parameters of a convolutional layer is dependent on the kernel size, the number of input filters and the number of output filters. The wider your network gets, the more expensive a 3x3 convolution will be.

def bottleneck(x, f=32, r=4):
  x = conv(x, f//r, k=1)
  x = conv(x, f//r, k=3)
  return conv(x, f, k=1)

The idea behind a bottleneck block is to reduce the number of channels by a certain rate r using a cheap 1x1 convolution, so that the following 3x3 convolution has fewer parameters. In the end we widen the network again with another 1x1 convolution.

Inception Module

The Inception Module introduced the idea of using different operations in parallel and merging the results. That way the network can learn different types of filters.

def naive_inception_module(x, f=32):
  a = conv(x, f, k=1)
  b = conv(x, f, k=3)
  c = conv(x, f, k=5)
  d = maxpool(x, k=3, s=1)
  return concatenate([a, b, c, d])

Here we merge convolutional layers using kernel sizes of 1, 3 and 5 with a MaxPooling layer. This snippet shows a naive implementation of an inception module. The actual implementation combines it with the idea of bottlenecks from above which makes it slightly more complex.

def inception_module(x, f=32, r=4):
  a = conv(x, f, k=1)
  b = conv(x, f//3, k=1)
  b = conv(b, f, k=3)
  c = conv(x, f//r, k=1)
  c = conv(c, f, k=5)
  d = maxpool(x, k=3, s=1)
  d = conv(d, f, k=1)
  return concatenate([a, b, c, d])

Residual Block

ResNet is an architecture introduced by researchers from Microsoft that allowed neural networks to have as many layers as they liked, while still improving the accuracy of the model. By now you may be used to this but before ResNet it just wasn’t the case.

def residual_block(x, f=32, r=4):
  m = conv(x, f//r, k=1)
  m = conv(m, f//r, k=3)
  m = conv(m, f, k=1)
  return add([x, m])

The idea is to add the initial activations to the output of a convolutional block. That way the network can decide through the learning process how much of the new convolutions to use for the output. Note that an inception module concatenates the outputs whereas a residual block adds them.

ResNeXt Block

Based on its name you can guess that ResNeXt is closely related to ResNet. The authors introduced the term cardinality to convolutional blocks as another dimension like width (number of channels) and depth (number of layers).

The cardinality refers to the number of parallel paths that appear in a block. This sounds similar to the inception block which features 4 operations happening in parallel. However, instead of using different types of operations in parallel a cardinality of 4 will simply use the same operation 4 times.

Why would you put them in parallel if they do the same thing? Good question. This concept is also referred to as a grouped convolution and goes back to the original AlexNet paper. Although, at the time it was primarily used to split the training process to multiple GPUs while ResNeXt uses them to increase parameter efficiency.

def resnext_block(x, f=32, r=2, c=4):
  l = []
  for i in range(c):
    m = conv(x, f//(c*r), k=1)
    m = conv(m, f//(c*r), k=3)
    m = conv(m, f, k=1)
    l.append(m)
  m = add(l)
  return add([x, m])

The idea is to take all of the input channels and divide them into groups. Convolutions will only act within their dedicated group of channels and not across all of them. It was found that each group will learn different types of features while increasing the efficiency of weights.

Imagine a bottleneck block that first reduces 256 input channels to 64 using a compression rate of 4 and then brings them back up to 256 channels as the output. If we wanted to introduce a cardinality of 32 and a compression of 2, we’d use 32 1x1 convolutional layers in parallel with 4 (256 / (32*2)) output channels each. Afterwards we’d use 32 3x3 convolutional layers with 4 output channels followed by 32 1x1 layers with 256 output channels each. The last step involves adding these 32 parallel paths giving us a single output before also adding the initial input to create a residual connection.

**Left**: ResNet Block — **Right**: RexNeXt Block of roughly the same parameter complexity

This is a lot to digest. Use the image above to get a visual representation of what’s going on and maybe copy the snippets to build a small network in Keras yourself. Isn’t it awesome that my complicated description can be summarized in these 9 simple lines of code?

By the way, if the cardinality equals the number of channels we’ll get something called a depthwise separable convolution. These gained a lot of popularity since the introduction of the Xception architecture.

Dense Block

A dense block is an extreme version of the residual block where every convolutional layer gets the output of all prior convolutional layers in the block. First, we add the input activation to a list after which we enter a loop that iterates over the depth of a block. Each convolutional output is also concatenated to the list, so that following iterations get more and more input features maps. This scheme continues until the desired depth is reached.

def dense_block(x, f=32, d=5):
    l = x
    for i in range(d):
        x = conv(l, f)
        l = concatenate([l, x])
    return l

While it takes months of research to get to an architecture that works as well as DenseNet, the actual building block can be as simple as this. Fascinating.

Squeeze-and-Excitation Block

SENet was the state of the art on ImageNet for a short period. It was built on top of ResNext and focussed on modeling the channelwise information of a network. In a regular convolutional layer each of the channels will have the same weight for the addition operation within a dot product calculation.

SENet introduced a very simple module that could be added to any existing architecture. It created a tiny neural network that learned how each of the filters should be weighted depending on the input. As you can see it’s not a convolutional block per se but because it can be added to any convolutional block and possibly enhance its performance, I wanted to add it to the mix.

def se_block(x, f, rate=16):
    m = gavgpool(x)
    m = dense(m, f // rate)
    m = dense(m, f, a='sigmoid')
    return multiply([x, m])

Each channel is compressed into a single value and fed into a two layer neural network. Depending on the channel distribution this network will learn to weight the channels based on their importance. In the end this weight is multiplied with the convolutional activations.

SENets introduce a tiny computational overhead while potentially improving any convolutional model. In my opinion, this block has not been getting the attention it deserved.

NASNet Normal Cell

This is where things get kind of ugly. We’re leaving the space of people coming up with simple yet effective design decisions and enter a world of algorithms designing neural network architectures. NASNet is incredible in the sense how it was designed but the actual architecture is comparatively complex. All we know is, it works pretty damn well on ImageNet.

By hand, the authors defined a search space of different types of convolutional and pooling layers with different possible settings for each. They also defined how these layers could be arranged in parallel, sequentially and how the can be added or concatenated. Once this was defined, they set up a Reinforcement Learning (RL) algorithm based on a recurrent neural network that was rewarded if a specific design proposal did well on the CIFAR-10 dataset.

The resulting architecture did not only perform well on CIFAR-10, it also achieved state-of-the-art results on ImageNet. NASNet consists of a Normal Cell and a Reduction Cell that are repeated after one another.

def normal_cell(x1, x2, f=32):
    a1 = sepconv(x1, f, k=3)
    a2 = sepconv(x1, f, k=5)
    a = add([a1, a2])
    b1 = avgpool(x1, k=3, s=1)
    b2 = avgpool(x1, k=3, s=1)
    b = add([b1, b2])
    c2 = avgpool(x2, k=3, s=1)
    c = add([x1, c2])
    d1 = sepconv(x2, f, k=5)
    d2 = sepconv(x1, f, k=3)
    d = add([d1, d2])
    e2 = sepconv(x2, f, k=3)
    e = add([x2, e2])
    return concatenate([a, b, c, d, e])

This how you could implement a Normal Cell in Keras. There is nothing new going on except that this very combination of layers and settings works really well.

Inverted Residual Block

By now you’ve heard of the bottleneck block and separable convolutions. Let’s put them together. If you run some tests you will notice that because separable convolutions already reduce the number of parameters, compressing them will probably hurt performance instead of increasing it.

The authors came up with the idea of actually doing the opposite of what a bottleneck residual block does. They increase the number of channels using a cheap 1x1 convolution because the following separable convolutional layer already greatly reduces the number of parameters. It brings down the channels before adding it to the initial activation.

def inv_residual_block(x, f=32, r=4):
  m = conv(x, f*r, k=1)
  m = sepconv(m, f, a='linear')
  return add([m, x])

The last piece to this puzzle is that the separable convolution is not followed by an activation function. Instead, it’s directly added to input. This block has shown to be very efficient when put into an architecture.

AmoebaNet Normal Cell

With AmoebaNet we arrive at the current state of the art on ImageNet and probably image recognition in general. Like NASNet, it was designed by an algorithm using the same search space as before. The only twist is that they switched out the Reinforcement Learning algorithm for a genetic algorithm often referred to as Evolution. Going into the details of how this works is beyond the scope of this article, though. The end of the story is that through Evolution the authors were able to find an even better solution with less computational costs compared to NASNet. It scores 97.87% Top-5 accuracy on ImageNet — A first for a single architecture.

Looking at the code, this block doesn’t add anything new to the mix that you haven’t seen. Why don’t you try to implement the new Normal Cell based on the picture to check if you were able to follow along.

Conclusion

I hope this article gave you a solid understanding of important convolutional blocks and that implementing them may be easier than you think. To get a more detailed view into these architectures have a look at their respective papers. You’ll notice that once you’ve grasped the core idea of a paper, it’s much easier to understand the rest. Please also note that the actual implementations often add BatchNormalization to the mix and vary in regards to where the activation function is applied. Feel free to follow up with questions in the comments.

PS: I’m thinking about creating a repository that includes all these blocks and architectures in simple code. Would that help some of you?