Deep Learning: Understand Theory for Better Practice

Forward and backward propagations for 2D Convolutional layers

Generalization to multi-channel inputs and multiple filters

Tristan Dardoize

Published in

Towards Data Science

6 min readDec 28, 2019

Motivation

Already numerous articles on TowardsDataScience discuss backpropagation for convolutional neural networks.

They explain well for simple cases (e.g. an input with only one channel, only one convolutional filter at the time) but I found it somehow difficult to generalize backward pass implementation to arbitrarily chosen layers.

To overcome this difficulty, I decided to come back to the theoretical aspect of backpropagation. After finding these equations, the implementation became a lot easier.

This article aims at giving you the key equations of forward propagation and of backpropagation for multi-channels inputs and multi-filters convolutional layers, and how do we get them.

If you are only interested in the results, feel free to jump to the conclusion!

A few definitions

A convolutional layer performs… convolutions! We need then to define useful mathematical operators:

On the one hand, the convolution between an image of an arbitrary size and C channels with the kernel K of size (k1, k2) is defined by:

On the other hand, the cross-correlation between an image of an arbitrary size and C channels with the kernel K of size (k1, k2) is defined by:

Cross-correlation of image I with kernel K

A sharp eye notices that the convolution of an image with a kernel is equivalent to the cross-correlation of the image with this kernel but flipped by 180°:

Therefore, we can directly consider the flipped kernel as being the kernel we want for the convolution and let our layer compute only a cross-correlation.

In this post, we will consider elements X as 3 or 4 dimensional arrays:

In the case where X has 3 dimensions, f does not appear on the notation. That’s it! We can now discuss the convolutional layer itself.

Convolutional layer

The layer transforms the output of the previous layer A_prev of height n_H_prev, width n_W_prev and C channels into the variable Z of height n_H, width n_W and of F channels.

The parameters of this layer are:

F kernels (or filters) defined by their weights w_{i,j,c}^f and biases b^f
Kernel sizes (k1, k2) explained above
An activation function
Strides (s1, s2) which defines the step on which the kernel is applied on the input image
Paddings p1, p2 which define the number of zero that we add on the borders of A_prev

Forward propagation

The convolutional layer forwards the padded input; therefore, we consider A_prev_pad for the convolution.

The equations of forward propagation are then:

Backward propagation

Backward propagation has three goals:

Propagate the error from a layer to the previous one
Compute the derivative of the error with respect to the weights
Compute the derivative of the error with respect to the biases

Notation

For ease of notation, we define:

The maths!

In practice, we perform the backward pass of a layer by always knowing da_{i,j,f} or dz_{i,j,f} and we assume we know da_{i,j,f} for this case.

The expression of dz_{i,j,f} is then given by Eq. [2]:

Where g’ is the derivative of g.

Using chain rule, we can compute dw_{i,j,c}^f:

Recalling that dz_{m,n,k} is only linked to the kth filter (given by Eq. [1]), the weights of the fth kernel are only linked to the fth channel of dz;

We can then obtain dz_{m,n,f}/dw_{i,j,c}^f using Eq. [1]:

The expression of Eq. [5] is then:

One can notice that this is the cross-correlation of A_prev_pad with the kernel dZ

The same procedure is followed for the bias:

And therefore:

The last thing to perform is the backpropagation of the error: finding the relation between dA_prev and dZ.

Remembering that we have a relation between dZ and the padded version of dA_prev (given in Eq. [1]), we will consider computing da_prev_pad.

Using chain rule (again!), we have:

We recognize dz_{m,n,f} as being the first term of the sum, this is good. Let’s focus on the second term:

Which is not equal to zero if and only if m’+m=i, n’+n=j and c’=c.

Therefore:

And so,

We notice that Eq. [9] describes a convolution where the layer’s filters are considered to be the image, and where dZ is the kernel.

We finally obtain da_prev_{i,j,c} by selecting da_prev_pad_{i+p1,j+p2,c}, p1 and p2 being the padding values around the first and second dimensions for this layer.

Conclusion

In this article, you have seen how to compute the forward and backward propagations for a convolutional layer with an arbitrary number of filters and of input channels.

The forward pass is given by two equations:

To compute the backpropagation of a convolutional layer, you only need to implement these four equations:

Other details could have been covered in this article (e.g. how to compute values for n_H and n_W) but numerous article have already discussed these points and therefore I encourage you to read them before you start implementing your own convolutional layers.