DATA SCIENCE
"From a computer vision point of view, there’s no doubt that deep convolutional neural networks are today’s "master algorithm" for dealing with perceptual data."
- Tomasz Malisiewicz
Nowadays, we all must have seen and used various effects and filters on images and how our computers and smartphones detect and recognize faces in photographs and videos. These all things are possible by "computer vision" which is nothing but machine learning using convolutional neural networks.
Computer vision is similar to human vision, it helps the system to recognize, classify, detect complex features in data. Some of its applications can be seen in self-driving cars, vision for robots, facial recognition.
But this computer vision is not completely the same as our human vision, unlike us the computer sees the image in the form of a matrix of pixels.
An image is made up of pixels. And each pixel value can take a value from 0 to 255.
What is a convolutional neural network
A convolutional neural network or CNN is a kind of neural network that is used in processing data with input shape in 2D matrix form like images.
The structure of a convolutional neural network is a feed-forward with several hidden layers in the sequence mainly convolution and pooling layers followed by activation layers. With this CNN model, we can recognize handwritten letters and human faces (depending on the number of layers and complexity of the image).

In this article, we will learn concepts of CNN and build an image classifier model for a better grasp of the subject.
Before building the model we need to understand and learn few important concepts of convolutional neural networks.
- As we already know, computers view images as numbers in the form of a matrix of pixels. CNN views images as three-dimensional objects where height and width are the first two dimensions and color encoding is the third dimension (for example, 3x3x3 RGB images).
Now just imagine, how computationally intensive it will be to process a 4K image (3840 x 2160 pixel).
Convolution
- So the main objective of convolutional networks is to reduce the images into the form which is easier to process while preserving the features and maintaining a good accuracy while predicting.
There are three main significant units in convolutional neural networks i.e. input image, feature detector, and feature map.
- A feature detector is a kernel of filter (a matrix of numbers, usually 3×3). Here the idea is to multiply the matrix representation of images, element-wise with the kernel to get a feature map. In this step, the size of the image is reduced for faster and simpler processing. Important features of the image are retained (like features that are unique to the image/object i.e. necessary for the recognition). However, some features are lost in this step.

- For example, if we have an input image of 5x5x1 dimensions and the convolution kernel/filter we apply to an image is of 3x3x1 dimension:
Image matrix:
1 1 0 1 1
1 0 1 0 1
1 1 1 1 0
0 0 1 1 0
1 1 0 0 0
Kernel matrix:
1 0 1
0 1 0
1 1 0
Then the convolved feature obtained after the multiplication of kernel matrix with each element of image matrix will be:
Convolved matrix:
3 5 3
3 2 5
4 4 2
Here, kernel shifts 9 times because stride length is 1 (i.e. filter will slide after each element of image matrix).
ReLu activation function
The purpose of applying this ReLu function (rectified linear unit) is to increase nonlinearity in the model. Since the image/object has several features that are not linear to each other. We apply this function so that our model does not treat image classification as a linear problem.
Pooling layer
The pooling layer is similar to the convolutional layer, and it is responsible for the reduction of the size of the convolved matrix.

It is an important step in the process of a convolutional neural network. Pooling is essential for detecting and extracting prominent features from images irrespective of different positions, angles, different lighting, etc. while maintaining the accuracy and efficiency of the training model.
Furthermore, as the size of the image data is reduced (while preserving the dominant features), the computational power required to process the data is also decreased.
There are different types of pooling: max pooling, min pooling, and average pooling.
- Max pooling extracts the maximum value from the portion of the feature map matrix covered by the kernel (specific pool size like 2×2).
- Min pooling extracts the minimum value from the portion of the feature map matrix covered by the kernel (specific pool size like 2×2).
- While average pooling average of all values is selected from the portion of the feature map matrix covered by the kernel (specific pool size like 2×2).
Max pooling is the most efficient of all the pooling methods (since it will contain the most dominant features of the convolutional feature map).
Convolved matrix:
3 5 4 1
2 2 5 6
4 4 2 5
1 3 5 4
Max pooled matrix:
5 6
4 5
Min pooled matrix:
2 1
1 2
Average pooled matrix:
3 4
3 4
Above these are the pooled feature maps.
The number of these convolution and pooling layers can be increased or decreased depending on the complexity of the input image and the level of details and features that has to be extracted. But remember the number of layer you increase in the model, the computational power required will also increase.
With these convolution and pooling layers, our model can understand extract the feature of the image.
Flattening
The next step is to flatten the pool feature map obtained i.e. transforming the multidimensional pool feature map matrix to a single dimension array (linear vector or column) to feed it to the neural network for processing and classification.
Full connection layer – Classification
After we have obtained our data in the form of a column vector, we will pass it through the feed-forward neural network, where the backpropagation is implemented over every iteration during the process of training (accuracy of prediction is improved).
After several epochs of training, our model will be able to recognize and distinguish between prominent and low-level features of the image.
The final output values obtained from the neural network may not sum up to one, but it is necessary to bring these values between zero and one. This will represent the probability of each class and further, classify them for the output using the softmax technique (activation function used for multi-class classification).
Implementation of CNN using MNIST dataset
In this article, we will be using the MNIST dataset i.e. a dataset of 70,000 (60,000 training images and 10,000 test images) small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9.
Here the objective of our model is to classify a given set of images of handwritten digits into 1 to 10 (representing integers from 0 to 9).
We will be using Keras and matplotlib library in this article.
The code below will load the first nine images of the MNIST dataset using Keras API and plot them using the matplotlib library.
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Dense
from keras.layers import Flatten
from keras.optimizers import SGD
from keras.datasets import mnist
from matplotlib import pyplot
# load dataset
(trainX, trainy), (testX, testy) = mnist.load_data()
# plot first 9 images
for i in range(9):
pyplot.subplot(330 + 1 + i)
pyplot.imshow(trainX[i], cmap=pyplot.get_cmap('gray'))
pyplot.show()
Training and testing images (that are already well defined by the model) are loaded separately as shown in the code above.

Now we will load the complete dataset and pre-process the data before feeding it to the neural network.
(trainX, trainY), (testX, testY) = mnist.load_data()
trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))
testX = testX.reshape((testX.shape[0], 28, 28, 1))
trainY = to_categorical(trainY)
testY = to_categorical(testY)
In the above, code we have reshaped the data to have a single color channel (since the images are of the same 28×28 pixel and greyscale form).
Further, we have one hot encoded dataset values (using to_categorical
, a Keras function) because we know there are ten distinct classes that all are represented by unique integers. Here each integer sample is transformed into a ten element binary vector with a one for the index of the class value, and zero values for all other classes.
After doing this, we will have to normalize our dataset as we know that the pixel value of images varies between 0 and 255 (black and white). For doing this we scale this data into the range of [0,1].
trainX = trainX.astype('float32')
testX = testX.astype('float32')
trainX = trainX / 255.0
testX = testX / 255.0
In the above code, we have first converted the integral values of pixel to floats. After that, we have divided those values by the maximum number (i.e. 255) so that all the values will be scaled in the range of [0,1].
Now we will start building our neural network.
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu',
kernel_initializer='he_uniform', input_shape=(28, 28, 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(100, activation='relu',
kernel_initializer='he_uniform'))
model.add(Dense(10, activation='softmax'))
opt = SGD(lr=0.01, momentum=0.9)
model.compile(optimizer=opt, loss='categorical_crossentropy',metrics=['accuracy'])
In the above code, we have used Keras API sequentaial()
that is used to create a model layer by layer. After that, we have added a single convolution layer for our model with a kernel size of (3×3) with 32 filters. It is followed by a single MaxPooling()
layer of the kernel size (2×2). Then the output feature map is flattened.
As we know that there are 10 classes, so there will be 10 nodes required in the output layer for the prediction of each class (multi-class classification) along with the softmax activation function. Between the feature extractor layers and output layer, we have added a dense layer with 100 nodes for feature analysis and interpretation by the model.
Stochastic gradient descent
(with a learning rate of 0.01 and momentum of 0.9) optimizer and categorical_crossentropy
loss function is used in the model (suitable for multi-class classification models).
Finally, after compiling our model, it needs to be trained on the training dataset and tested on the testing dataset, and further to evaluate its results (i.e. accuracy and loss).
batch_size = 128
num_epoch = 10
#model training
model_log = model.fit(trainX, trainY,
batch_size=batch_size,
epochs=num_epoch,
verbose=1,
validation_data=(testX, testY))
In the above code, we have used 10 epochs
with a batch_size
of 128 (batch size is the number of samples trained in one iteration). And following is the training output of the model:

The above results can be evaluated in terms of performance:
score = model.evaluate(testX, testY, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

With a test accuracy >98% we can say that our model is trained well for accurate prediction. You can also visualize these results using matplotlib
library!
Conclusion
I hope with this article you will be able to understand and grasp the concepts of convolutional neural networks.
For a better understanding of these concepts, I will recommend you try writing these codes on your once. Keep exploring, and I am sure you will discover new features along the way.
If you have any questions or comments, please post them in the comment section.
Check out this complete Data Visualization guide:
Originally published at: www.patataeater.blogspot.com