Squeeze-and-Excitation Networks

Setting a new state of the art on ImageNet

Published in

Towards Data Science

4 min readOct 17, 2017

Squeeze-and-Excitation Networks (SENets) introduce a building block for CNNs that improves channel interdependencies at almost no computational cost. They were used at this years ImageNet competition and helped to improve the result from last year by 25%. Besides this huge performance boost, they can be easily added to existing architectures. The main idea is this:

Let’s add parameters to each channel of a convolutional block so that the network can adaptively adjust the weighting of each feature map.

As simple as it may sound, this is it. So, let’s take a closer look at why this works so well and how we can potentially improve any model with five simple lines of code.

The “Why”

CNNs use their convolutional filters to extract hierarchal information from images. Lower layers find trivial pieces of context like edges or high frequencies, while upper layers can detect faces, text or other complex geometrical shapes. They extract whatever is necessary to solve a task efficiently.

All of this works by fusing the spatial and channel information of an image. The different filters will first find spatial features in each input channel before adding the information across all available output channels. I addressed this operation with a little more detail in another article.

All you need to understand for now is that the network weights each of its channels equally when creating the output feature maps. SENets are all about changing this by adding a content aware mechanism to weight each channel adaptively. In it’s most basic form this could mean adding a single parameter to each channel and giving it a linear scalar how relevant each one is.

However, the authors push it a little further. First, they get a global understanding of each channel by squeezing the feature maps to a single numeric value. This results in a vector of size n, where n is equal to the number of convolutional channels. Afterwards, it is fed through a two-layer neural network, which outputs a vector of the same size. These n values can now be used as weights on the original features maps, scaling each channel based on its importance.

The “How”

In the last paragraph you may have lost a little confidence that this is really as simple as I promised it to be. So, let’s jump straight into implementing an SE-block.

def se_block(in_block, ch, ratio=16):
    x = GlobalAveragePooling2D()(in_block)
    x = Dense(ch//ratio, activation='relu')(x)
    x = Dense(ch, activation='sigmoid')(x)
    return multiply()([in_block, x])

The function is given an input convolutional block and the current number of channels it has
We squeeze each channel to a single numeric value using average pooling
A fully connected layer followed by a ReLU function adds the necessary nonlinearity. It’s output channel complexity is also reduced by a certain ratio.
A second fully connected layer followed by a Sigmoid activation gives each channel a smooth gating function.
At last, we weight each feature map of the convolutional block based on the result of our side network.

These five steps add almost no additional computing cost (less than 1%) and can be added to any model.

Vanilla ResNet Module vs the proposed SE-ResNet Module

The authors show that by adding SE-blocks to ResNet-50 you can expect almost the same accuracy as ResNet-101 delivers. This is impressive for a model requiring only half of the computational costs. The paper further investigates other architectures like Inception, Inception-ResNet and ResNeXt. The latter leads them to a modified version that shows a top-5 error of 3.79% on ImageNet.

How SENets improve existing architectures

What amazes me the most about SENets is just how simple and yet effective they are. Being able to add this approach to any model at almost no cost, should make you jump back to the drawing board and retraining everything you ever built.

This was the first article in a series of paper summaries I’m planning to write. I want to force myself into reading and understanding new papers to keep up with recent AI trends. If you want to contribute or find mistakes in my articles, please reach out to me.

Squeeze-and-Excitation Networks

Setting a new state of the art on ImageNet

The “Why”

The “How”

Written by Paul-Louis Pröve