Build Your Own Convolution Neural Network in 5 mins

An introduction to CNN and code (Keras)

Rohith Gandhi

Published in

Towards Data Science

5 min readMay 18, 2018

What is a Convolutional Neural Network??

Image result for convolutional neural network — CNN

Before answering what a convolutional neural network is, I believe you guys are aware of what neural networks are. If you are shaky on the basics, check out this link. Moving on. A convolution neural network is similar to a multi-layer perceptron network. The major differences are what the network learns, how they are structured and what purpose they are mostly used for. Convolutional neural networks were also inspired from biological processes, their structure has a semblance of the visual cortex present in an animal. CNNs are largely applied in the domain of computer vision and has been highly successful in achieving state of the art performance on various test cases.

What do the hidden layers learn??

The hidden layers in a CNN are generally convolution and pooling(downsampling) layers. In each convolution layer, we take a filter of a small size and move that filter across the image and perform convolution operations. Convolution operations are nothing but element-wise matrix multiplication between the filter values and the pixels in the image and the resultant values are summed.

Convolution operations

The filter’s values are tuned through the iterative process of training and after a neural net has trained for certain number of epochs, these filters start to look out for various features in the image. Take the example of face detection using a convolutional neural network. The earlier layers of the network looks for simple features such as edges at different orientations etc. As we progress through the network, the layers start detecting more complex features and when you look at the features detected by the final layers, they almost look like a face.

Different features recognised at different layers

Now, let’s move on to pooling layers. Pooling layers are used to downsample the image. The image would contain a lot of pixel values and it is typically easy for the network to learn the features if the image size is progressively reduced. Pooling layers help in reducing the number of parameters required and hence, this reduces the computation required. Pooling also helps in avoiding overfitting. There are two types of pooling operation that could be done:

Max Pooling — Selecting the maximum value
Average Pooling — Sum all of the values and dividing it by the total number of values

Average pooling is rarely used, you could find max pooling used in most of the examples.

Code

Before we start coding, I would like to let you know that the dataset we are going to be using is the MNIST digits dataset and we are going to be using the Keras library with a Tensorflow backend for building the model. Ok, enough. Let’s do some coding.

First, let us do some necessary imports. The keras library helps us build our convolutional neural network. We download the mnist dataset through keras. We import a sequential model which is a pre-built keras model where you can just add the layers. We import the convolution and pooling layers. We also import dense layers as they are used to predict the labels. The dropout layer reduces overfitting and the flatten layer expands a three-dimensional vector into a one-dimensional vector. Finally, we import numpy for matrix operations.

Most of the statements in the above code would be trivial, I would just explain some lines of the code. We reshape x_train and x_test because our CNN accepts only a four-dimensional vector. The value 60000 represents the number of images in the training data, 28 represents the image size and 1 represents the number of channels. The number of channels is set to 1 if the image is in grayscale and if the image is in RGB format, the number of channels is set to 3. We also convert our target values into binary class matrices. To know what binary class matrices look like take a look at the example below.

Y = 2 # the value 2 represents that the image has digit 2 Y = [0,0,1,0,0,0,0,0,0,0] # The 2nd position in the vector is made 1# Here, the class value is converted into a binary class matrix

We build a sequential model and add convolutional layers and max pooling layers to it. We also add dropout layers in between, dropout randomly switches off some neurons in the network which forces the data to find new paths. Therefore, this reduces overfitting. We add dense layers at the end which are used for class prediction(0–9).

We now compile the model with a categorical cross entropy loss function, Adadelta optimizer and an accuracy metric. We then fit the dataset to the model, i.e we train the model for 12 epochs. After training the model, we evaluate the loss and accuracy of the model on the test data and print it.

Output

Conclusion

Convolutional neural networks do have some shortcomings pointed out by Geoffrey Hinton. He posited that his capsule networks are the way to go if we are looking to achieve human-level accuracy in the domain of computer vision. But, as of now, CNNs seem to be doing really well. Please let me know if you found this article to be useful, thank you :)