The world’s leading publication for data science, AI, and ML professionals.

CNN: Can a computer really see?

Convolutional Neural Network explained!

Convolutional Neural Network (CNN or ConvNet) is a class of Deep Learning, inspired from the human vision system and specialized in extracting high-level features from complex inputs such as images, texts, and audio.

Photo by NeONBRAND on Unsplash
Photo by NeONBRAND on Unsplash

In fact, unlike the classical neural network where features are manually engineered, Cnn takes the inputs’ raw data, automatically distinguishes meaningful patterns and learn based on those features instead of raw pixels. This is what makes CNN the frequently used neural network in image analysis[1].

But how can a computer analyze images?

Basically, the convolutional neural network is composed mainly of:

  • Input layer
  • Convolutional layers
  • Activation function Layers
  • Pooling Layers
  • Flatten and Fully-connected Layer

Let’s first start by the input layer of the CNN to understand how a computer sees an image.

Input Layer:

An image is usually represented as a three dimensional (3D) matrices of pixels’ values:

  • Height and width depend on the input image’s dimensions.
  • depth is generally three channel RGB (Red-Green-Blue) for the value of each pixel’s color.
Image by author: Example of an image representation
Image by author: Example of an image representation

Convolutional Layer:

The convolution operation is the basic component of CNN, invented by Y. Bengio and Yann Lecun [2], it consists in applying a filter on the input image to detect the features related to each class.

Image by author: Red channel of 9 pixels image
Image by author: Red channel of 9 pixels image

The filter (fixed size) is composed of randomly initialized weights that will be updated through the back-propagation with each input.

The filter slides both vertically and horizontally through the image with a stride (number of pixels skipped in the convolution) multiplying the images’ pixels’ values by the values of the filter. And then, all these multiplications are summed in a number composing a pixel of the feature map.

Image by author: Horizontal slide operation
Image by author: Horizontal slide operation
Image by author: Vertical slide operation
Image by author: Vertical slide operation

At the end of the convolution process, the feature map obtained represents a smaller matrix containing the detected patterns of the input image.

Obviously, the more filters we apply on the image, the more features are extracted and the better the network becomes at detecting patterns in images.

Finally, the convolution layer is composed of multiple filters of the same size that outputs various feature maps from an input of a 3D channel representation of an image. It is summarized in the figure bellow:

Image by author: Convolutional Layer of an input image of 99 pixels with stride=1 and size of filter =33
Image by author: Convolutional Layer of an input image of 99 pixels with stride=1 and size of filter =33

Non-Linearity with ReLU:

Considering the non-linearity of real-world data, we must introduce a layer of activation function after each of the convolution layers, since convolution is a linear operation (multiplication and addition).

Indeed by increasing the non-linearity of the network, we create a complex network that enables us to detect and distinguish numerous patterns of the input image.

The most commonly used activation function for this task is ReLU [16] (Rectified Linear Unit). Mathematically, it is defined as:

Image by author
Image by author

Thanks to its mathematical properties, it is an element-wise operation (applied per pixel in our case) that replaces the negative pixels’ values with zeroes in the feature map.

Pooling Layer:

Similar to the convolutional layer, the pooling layer is used to further reduce the dimension of the previous matrices independently (dimensional reduction) to significantly decrease the amount of parameters and the computational power required to the data processing.

Furthermore, this operation extracts dominant features from the input with maintaining the process of effectively training the model as the network becomes invariant to small transformations and translations of the input image.

Image by author: Pooling Layer for an input image of 77 pixels with stride 2 and filter of size 33 using the Max pooling approach
Image by author: Pooling Layer for an input image of 77 pixels with stride 2 and filter of size 33 using the Max pooling approach

Pooling can be accomplished in multiple approaches: Max pooling, Min pooling, Mean pooling and Average pooling. The most common approach used is Max pooling.

Flatten and Fully-connected Layer:

The combination of the convolutional layer and pooling layer extracts dominant features from the input image into a number of matrices, then flatten converts it to a 1-dimensional array creating a single long feature vector suitable to be the input of the fully-connected Layer.

The fully-connected layer accomplishes the task of classification, it represents a Multi-layer Perceptron with mostly softmax activation function. The network will then be able to learn non-linear combinations of the high-level features and distinguish each image specification.

Image by author: Example of a Flatten and Fully-connected Layer
Image by author: Example of a Flatten and Fully-connected Layer

To summarize, the Cnn Architecture performs two major tasks:

  • Feature extraction: convolutional layers + pooling layers
  • Classification: fully-connected layer

The image represents the full architecture of a CNN:

Image by author: Summary of the CNN architecture for the example of an input of 9*9 pixels
Image by author: Summary of the CNN architecture for the example of an input of 9*9 pixels

In general the more convolutional layers we have, the more features the model will be able to recognize.

References:

[1] Maria Valueva, Nikolay Nagornov, Pavel Lyakhov, G.V. Valuev, and N.I.Chervyakov. Application of the residue number system to reduce hardwarecosts of the convolutional neural network implementation Mathematics and Computers in Simulation, 177, 05 2020.

[2] Y. Bengio and Yann Lecun. Convolutional networks for images, speech,and time-series. 11 1997.

[3] Hidenori Ide. and Takio Kurita. Improvement of learning for cnn with relu activation by sparse regularization.IEEE, 07 2017.


Related Articles