Introduction
In the last article we saw how to do forward and backward propagation for convolution operations in CNNs. It was found that applying the Pooling layer after the convolution layer improves performance helping the network to generalize better and reduce overfitting. This is because, given a certain grid (pooling height x pooling width) we sample only one value from it ignoring particular elements and suppressing noise. Moreover, because pooling reduces the spatial dimension of the feature maps coming from the previous layer and it doesn’t add any parameters to learn, it helps in decreasing the model complexity, computational costs and results in faster training.
Forward propagation
We assume that after the Convolution operation we get an output of shape 4×4. We want then to do max pooling with pooling height , pooling width and stride all equal to 2. Pooling is similar to convolution, but instead of doing an element-wise multiplication between the weights and a region in the input and summing them up to get the element for a certain cell in the output matrix, we simply select the maximum element from that region. The following visualization will clarify:

The output shape after pooling operation is obtained using the following formula: _H_out = floor(1 + (H – pool_height)/stride) W_out = floor(1 + (W – poolwidth)/stride) where H is height of the input, _poolheight is height of the pooling region W is width of the input, _poolwidth is width of the pooling region
In our example we get: _Hout = floor(1 + (4–2)/2) = 2 _Wout= floor(1 + (4–2)/2) = 2
This is how it can be implemented in code:
N = 1 # number examples
C = 1 # number channels
H = 4 # height input
W = 4 # width input
x = np.linspace(-0.3, 0.4, num=np.prod(x_shape)).reshape(x_shape)
pool_param = {'pool_width': 2, 'pool_height': 2, 'stride': 2}
pool_height = pool_param['pool_height']
pool_width = pool_param['pool_width']
stride = pool_param['stride']
H_out = int(1 + (H - pool_height) / stride)
W_out = int(1 + (W - pool_width) / stride)
out = np.zeros((N, C, H_out, W_out))
for n in range(N):
for c in range(C):
for hi in range(H_out):
for wi in range(W_out):
out[n, c, hi, wi] = np.max(x[n, c, hi * stride : hi * stride + pool_height, wi * stride : wi * stride + pool_width ])
Backpropagation
Differently from convolution operations, we do not have to compute here weights and bias derivatives as there are no parameters in a pooling operation. Thus, the only derivative we need to compute is with respect to the input, ∂_Y/_ ∂_X._We know that the derivative with respect to the inputs will have the same shape as the input. Let’s look at the first element of ∂Y_/ ∂**X — ∂**Y/ ∂__x_₁₁.

It is clear that the derivative of ∂Y/ ∂_x_₁₁ = ∂y₁₁/∂_x_₁₁ is different from zero only if _x_₁₁ is the maximum element in the first pooling operation with respect to the first region. Assuming that the maximum element in the first region is _x_₁₂ , ∂y₁₁/∂_x_₁₂ = ∂x₁₁/∂_x_₁₂ = 1 and the derivatives with respect to the other x_ᵢ_ⱼ in the first pooling regions are zero. Again, because we have an incoming derivative from the following layer we will need to multiply the local gradient by the incoming gradient following the chain rule. Thus, assuming dy₁₁to the the incoming derivative, we will have for the first region all the gradients to be zero apart from ∂x₁₁/∂_x_₁₂ = 1 * dy₁₁ = dy₁₁.

In code:
dout = np.random.randn(N, C, H_out, W_out) # output gradients
dx = np.zeros_like(x)
for n in range(N):
for c in range(C):
for i in range(H_out):
for j in range(W_out):
# get the index in the region i,j where the value is the maximum
i_t, j_t = np.where(np.max(x[n, c, i * stride : i * stride + pool_height, j * stride : j * stride + pool_width]) == x[n, c, i * stride : i * stride + pool_height, j * stride : j * stride + pool_width])
i_t, j_t = i_t[0], j_t[0]
# only the position of the maximum element in the region i,j gets the incoming gradient, the other gradients are zero
dx[n, c, i * stride : i * stride + pool_height, j * stride : j * stride + pool_width][i_t, j_t] = dout[n, c, i, j]
Thus, assuming the following input and incoming derivative:


the output and the gradients with respect to the inputs would be:


Conclusions
If you followed my previous article on backward and forward propagation for convolution operations I am sure this article was a piece of cake for you! Pooling operations are very important for a better model generalization, reducing complexity and increase the training speed. However, somebody claims it might be bad in some cases because by downsampling features we might lose some important information, critical to classify correctly an object. Nevertheless, the advantages of a pooling layer are big enough to go for it when building a CNN model.