What is wrong with Convolutional neural networks ?

Introduction

Of course convolutional neural networks (CNNs) are fascinating and strong tool, maybe it’s one of the reasons Deep learning is so popular these days, since Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton published “ImageNet Classification with Deep Convolutional Networks” in 2012, CNN's has been the winning card in computer vision achieving superhuman performance in many tasks, but are CNN’s flawless? is that the best that we can do? i guess from the title you figured that the answer is NO.

December 4, 2014, Geoffrey Hinton gave a speech in MIT about a project of his called capsule networks, and he discussed the problems with CNN’s and why pooling is very bad and the fact that it’s working so well is a disaster

if you are familiar with CNN’s you can skip to what’s wrong?

Convolutional layers

A Convolutional layer have a set of matrices that get multiplied by the previous layer output in a process called the convolution to detect some features this features could be basic features (e.g. edge, color grade or pattern) or complex one (e.g. shape, nose, or a mouth) so, those matrices are called filters or kernels

(source)

Pooling layers

There is more than one type of pooling layer (Max pooling, avg pooling …), the most common -this days- is Max pooling because it gives transational variance — poor but good enough for some tasks — and it reduces the dimensionality of the network so cheaply (with no parameters)
max pooling layers is actually very simple, you predefine a filter (a window) and swap this window across the input taking the max of the values contained in the window to be the output

max pooling with filter size 2*2 (source)
(source)

What is wrong?

1- Backpropagation

backprob is a method to find the contribution of every weight in the error after a batch of data is prepossessed and most of good optimization algorithms (SGD, ADAM … ) uses Backpropagation to find the gradients

backpropagation has been doing so good in the last years but is not an efficient way of learning, because it needs huge dataset
i believe that we can do better

2- Translation invariance

when we say translational invariance we mean that the same object with slightly change of orientation or position might not fire up the neuron that is supposed to recognize that object

(source)

As in the image above if we assumed that there is a neuron that is supposed to detect cats it’s value will change with the change of the position and rotation of the cat, data augmentation partially solves the problem but it does not get rid of it totally

3- Pooling layers

Pooling layers is a big mistake because it loses a lot of valuable information and it ignores the relation between the part and the whole if we are talking about a face detector so we have to combine some features (mouth, 2 eyes, face oval and a nose) to say that is a face 
CNN would say if those 5 features present with high probability this would be a face

(source)

so the output of the two images might be similar which is not good

Conclusion

CNN’s are awesome but it have 2 very dangerous flaws Translation invariance and pooling layers, luckily we can reduce the danger with data augmentation but something is coming up (capsule networks) we have to be ready and open to the change

sources