We Need to Rethink Convolutional Neural Networks

(They kind of suck)

Published in

Towards Data Science

6 min readSep 18, 2020

Convolutional Neural Networks (CNNs) have shown impressive state-of-the-art performance on multiple standard datasets, and no doubt they have been instrumental in the development and research acceleration around the field of image processing.

There’s one problem: they kind of suck.

Researchers often have a problem of getting too wrapped in the closed world of theory and perfect datasets. Unfortunately, chasing extra fractions of percentage points on accuracy is actually counterproductive to the real usages of image processing: the real world. When algorithms and methods are designed with the noiseless and perfectly predictable world of a dataset in mind, they very well may perform poorly in the real world.

This has certainly shown to be the case. Convolutional neural networks are especially prone to ‘adversarial’ inputs, or small changes to the input that, unintentionally or intentionally, confuse the neural network.

Recently in 2020, the cybersecurity firm McAfee showed that Mobileye — the car intelligence system used by Tesla and other auto manufacturers — could be fooled into accelerating 50 MPH over the speed limit just by plastering a strip of black tape two inches wide to a speed limit sign.

Researchers from four universities including the University of Washington and UC Berkeley discovered that road sign recognition models were completely fooled when introduced to a bit of spray paint or stickers on stop signs — all completely natural and non-malicious alterations.

Even more importantly, convolutional neural networks are really bad at generalizing across shifts and rotations in images, not to mention different angles of three-dimensional viewing.

Source: Evtimov et al. Image free to share.

To understand why convolutional neural networks have so much trouble generalizing across perspectives and angles, one must first understand why a convolutional neural network works at all — and what is so special about convolutional and pooling layers.

Since convolutional layers apply the same filter across the entire image — which can be thought of as some sort of ‘feature detector’ that looks for lines or other features — it is translation invariant (not affected by translation). Regardless if an object appears in the top left or the bottom right, it will be detected because filters cross the entire image. Pooling helps ‘summarize’ the findings within each region for further smoothing. With convolutional and pooling layers, objects can be still detected in different regions, or even with slight tilting and scaling.

Source: HackerNoon. Image free to share.

On the other hand, filters wouldn’t be able to capture scaling. The red box represents the filter that produces a high activation when it comes across a bird. In a scaled image, no position of the filter can produce a high activation since the filter’s size is limited.

The same applies with rotation. A filter is simply a matrix of weights that produces high values if certain pixels are certain values in relationship with other values. Because filters are fixed and move in a top-down left-right movement, it does not recognize images that are not viewed from the same axis orientation.

The standard way to deal with this is by using data augmentation, but it’s not a great solution either. The convolutional neural network simply memorizes that an object can also appear in that approximate orientation and size, instead of necessarily generalizing to all viewpoint. It is practically infeasible to expose the network to all viewpoints of all objects.

Another method to deal with this is by using higher-dimensional maps, which are of course incredibly inefficient.

Geoff Hinton describes CNNs as trying to model invariance — neural activities are pooled, or smoothed, in order to be invariant to small changes. “This is”, he writes, “the wrong goal. It is motivated by the fact that the final label needs to be viewpoint-invariant.” Instead, he proposes an aim for equivariance — neural activities change depending on changes on viewpoint. Weights code the invariant knowledge of a shape, not filters of activations.

Additionally, CNNs parse an image as a whole entity, and not as the composition of several objects. There is no explicit representation of different entities and their relationships, which means it is not robust to objects it has not seen before. This also means it takes a brute-force way to image recognition, memorizing richer and more detailed representations of an image for each individual pixel, instead of looking at its parts (e.g. thin tires + frame + handles = bike).

Much of this is because convolutional neural networks don’t recognize images like humans do. Yes, under the perfect environments of standard datasets or with simple tasks where rotation and shift are not common or important, CNNs perform well. However, as we move to demand more from image processing, we need an update.

One approach towards addressing the invariance problem is through a Spatial Transformer, which defines the axes and image bounds before prediction. This can help correct scaling (first row) and rotation (second row) imbalances, as well as noise (third row) through attention mechanisms.

Source: Spatial Transformer Networks. Image free to share.

In fact, it can undo complex distortions, which is incredibly valuable given the complex nature of three-dimensional viewpoint that go beyond rotation and scaling transformations.

Several other architectures, like the Scale-invariant Convolutional Neural Network (SiCNN), have been recently proposed.

More famously, Geoff Hinton proposes a capsule network, which explicitly builds the idea of recognizing individual parts — which he argues is the natural, human method of recognition — through hierarchies.

Hinton points out that the task of computer vision is really inverse computer graphics. Graphics programs use hierarchical models that compute the spatial structure based on position-invariant matrices. Viewpoints are simply a matrix multiplication.

Thus, it should be the goal of an image recognition network to find the connection between a viewpoint representation and the ‘intrinsic’ object representation, which is the same regardless of viewpoint.

Each capsule is delegated to one of these intrinsic objects, and recognizes them regardless of the angle they are viewed at by forcing the model to learn feature variances. This leads to better extrapolation, which means that image models begin to truly generalize well across all viewpoints of the camera.

Capsules encode spatial information and only proceed with ‘routing by agreement’, meaning that the network only sends lower-level features like eyes, noses, and lips to higher-level layers if their contents are similar.

Obviously, this is a completely different paradigm than convolutional neural networks. Perhaps, however, it is this shift in approaching image recognition that is necessary to move past the days of designing for datasets and instead towards implementing more intelligent and robust models that perform better on increasingly complex and in-demand real-world tasks.

We Need to Rethink Convolutional Neural Networks

(They kind of suck)

Written by Andre Ye