It took me soo much time to understand how CNN is working. And trust me, the content on this is incredibly low, like really low. Everywhere they will tell you how forward propagation works in CNN, but never even start on backward prorogation. And without understanding the full picture, one’s understanding always remains half-baked.
Pre-requisites
- Should be familiar with Basics of CNN – Convolution layer, Max Pooling, Fully Connected layer. Do a basic google search and understand these concepts. Should take an hour or so to get started.
- Differential calculus – Should know how chain rule works and basic rules of differentiation.
- Should know how back propagation maths actually works in ANN. I would highly recommend reading my previous article on this, in case you aren’t aware.
Nature of this article
So, my main problem with the rest of the articles was – nowhere they have mentioned the overall flow. Each layer and concepts were beautifully explained but how do backpropagation works across the layers – that info was missing. So, it was very difficult for me to visualize how the errors were flowing backward overall. So, this article will take a few scenarios of CNN and will try to make you understand the overall flow.
The intent is not to cover depth but the breadth and overall flow. For depth, I’ll point you to the relevant articles wherever needed to help you with a deeper intuition. Consider this article as an index for CNN maths. Just to set the expectations clearly, this won’t be a 5 min read. But, will ask you to read up relevant articles as and when required.
Scenario 1: 1 Convolution Layer + 1 Fully Connected Layer

Forward Pass
X is the input image, say (33 matrix) and Filter is a (22 matrix). Both will be convoluted to give output XX (2*2 matrix).
Now, XX will be flattened and will be fed to a fully connected network with w (1*4 matrix) as weights, which will give an— Output.
Finally, we will calculate Error in the end by calculating mean squared error between Y (expected) and Output (actual).

I would highly advise you to calculate XX on your own. This will give you an intuition of Convolution layer.
The Future of Humanity is Genetic Engineering and Neural Implants | Data Driven Investor
Backward Pass
The goal of the backward pass is to choose such values of Filter and w, so that we can decrease E. So, basically our goal is how should we change w and Filter, so that E can be decreased.

Let’s start with the first term.

Guess how?!
Line 1: Using chain rule. Line 2: Using differential calculus. Spend a minute on this. Should be easily understandable. If not check my previous article (mentioned in the pre-requisite). Or check this one.
Before moving forward, do make sure you do these calculations on your own. Please comment in case this is not easily understandable.
Moving to the second term.

Ooo.. too much of weird logic here? Stick with me, will help you understand this.
Line 1: Basic chain rule Line 2: 1st and third term is on the lines of above calculations itself. Again spend a minute or do it on paper to understand this.
Now, what the heck is this rotated w!? Took me a really long time to understand how this was calculated. For this, you need to go through these concepts in the mentioned order only.
- Transposed Convolution – Output is [11 matrix] and XX is [14 matrix (because was flattened here)], right. So, when we backpropagate, we are increasing the size of the matrix. Transpose convolution helps in this. Fast forward the video and just see the logic where they are taking transpose convolution.
- Now take a deep breath, and read through this. This is the most important one to understand the intuition for calculating how output changes with filter and X. Pasting the conclusion from the above article. JFYI, don’t get confused from Full convolution, its nothing but Transpose convolution (which you just understood above).

Conclusively, we can decrease Filter and w value as this.

Scenario 2– 2 Convolution Layer + 1 Fully Connected Layer

Now, add as many layer of convolution, our approach will remain the same. As usual, the goal would be:

First 2 terms we have already calculated above. Let’s see what will be the formula for the last term.

In case you need to deep dive into this. I would recommend this article. And calculate new F1, F2 and w accordingly.
Scenario 3— What about Max Pooling layer??
Max Pooling is an important concept of the CNN and how does back propagation work for it?

If you think about it, there are no variables in max pooling layer like filters. So, we don’t need to adjust any value here.
But, it’s affecting my previous layers, right?! It’s reducing the size of matrix by combining few entries in a matrix to a single digit. Basically, it does affect backpropagation. It says that values that have non-maximum value won’t have any gradient.
So, what we are saying here is – All the values which doesn’t have maximum value will have 0 as their value. More depth.

Have tried to put all good and relevant articles in one place, and to help you see the overall picture of convolution. Please go through the above and let me know if something still stands missing in the overall flow – will be happy to edit it to accommodate the same.