The world’s leading publication for data science, AI, and ML professionals.

Image Classification Explained to My Grandma

What are CNNs in plain English

In 2019 I take part in an international context organized by CodeProject and I win with my project KerasUI, a tool for implementing a web GUI for training and consume neural networks. It was a special opportunity for refreshing my knowledge of neural networks and artificial intelligence after few years of inactivity in this field. Now that two years are passed, I talk occasionally with friends and colleagues but I still feel like the community didn’t have the right awareness on this topic. Apart from experts, most people simply ignore what is Image Classification or simply identify it with some methods of the TensorFlow library. Well, this pushes me to write again on this topic and focus only on the theoretical part!

At the end of the article, you will find the link for reading the prize-winning article and for downloading the code. I tried to make things as much easy as possible, so easy that could be explained to my grandma 😃

Let’s jump into the explanation!

Photo by Possessed Photography on Unsplash
Photo by Possessed Photography on Unsplash

Learning What is Image Classification

The Image Classification problem is the task of assigning to an image one label from a fixed set of categories. This is one of the core problems in Computer Vision that, despite its simplicity, has a large variety of practical applications. In poor words, what you want is that if you give an image of a dog to the computer, it tells you "it’s a dog".

This is a problem that can be resolved using Artificial Intelligence and computer vision. Computer vision helps to manipulate and preprocess images to get them in a form that computers can use (from a bitmap to a matrix of relevant values). Once you have the input in a good form, you can apply an algorithm to predict the result.

The most common solution nowadays is to use CNN (Convolutional Neural Network). Such a kind of neural network is very convenient for image processing and is trained against the dataset you will provide.

Dataset is just a list of samples, each one is labelled. The main topic is that you tell the machine how to decide by example. Usually, the dataset is divided into a training set, test set, and validation set. This is because you want to train the network, then test how it works on separate data until it works as expected. Finally, if you want to have objective feedback, you must use other data: the validation set. This is required mostly because if you let the network train always on the same data, it will drop any error, but will be able to work only with the sample you provide. So, if you put into something a little bit different, you want to get a good result. That’s called overfitting and is something to avoid because it means the network didn’t abstract the rules but just repeated what you tell it. Think about a math expression 2*5+10, it’s something like remember that the result is 25 instead be able to evaluate it.

Understanding The Convolutional Neural Networks

In this section, we will learn what is a CNN and how it works. The next diagram is a graphical representation of the network. It will be explained in the following sections, but for now, you can imagine that a CNN is like a preprocessing pipeline that polish data for the final neural network. I know, this definition may seem very rouge, but let’s jump to the next paragraph to see how it works in detail!

A sample architecture for a CNN. made with ❤️ by the Daniele Fontani
A sample architecture for a CNN. made with ❤️ by the Daniele Fontani

The Input

In Image Classification, we start from… images! It is not so hard to understand that an image is a bidimensional matrix (width * height), composed by pixel. Each pixel is composed of 3 different values, in RGB, red, green and blue. To use CNN is convenient to separate the 3 different layers, so your final input matrix to represent your image will be _image_size x imagesize x 3. Of course, if you have a black & white image, you don’t need 3 layers, but only one, so you’ll have _image_size x imagesize x 1. Well, if you consider also that your dataset will be composed of N items, the whole input matrix will be N x image size x image size x 3. The size of the image must be the same for all and cannot be a full HD image, to avoid too long time processing. There are no written rules for this but is often a compromise: 256×256 may be a good value in some cases, in other words, you will need more resolution.

The Convolution

Inside images, many details, hints, and shades are not relevant for the network. All that detail can confuse the training so the main idea is to simplify the image keeping all the data that brings information. This is intuitive and is so easy in word but in practice? The CNN uses a convolutional step, that’s the core of this method, to reduce the size of the image keeping the more relevant part of the image. The convolution layer has this name because it makes the convolution between a sliding piece of the matrix and a filter. The size of the filter and the piece of the matrix to analyze are the same. This piece is called Kernel. To make the matrix size suitable for the kernel size, it is padded with zeroes in all dimensions. Convolution produces a scalar value for all kernel multiplication. This will produce a size drop, for example with kernel 4×4 over a 32×32 matrix (1024 elements), you will have in output a 4×4 matrix (16 elements). The size of the kernel impact the final result and often is better to keep the small kernel and chain multiple convolutional layers, where I can add some pooling layer in the middle (I’ll speak about pooling later).

The convoluton process. made with ❤️ by the Daniele Fontani
The convoluton process. made with ❤️ by the Daniele Fontani

The Pooling

The pooling layer is used to reduce the matrix size. There are many approaches, but the basic is: I take a set of adjacent values and I used only one. The most common algorithm is the max pooling, so, basically, you take the bigger element into the set.

The max pool layer. made with ❤️ by the author
The max pool layer. made with ❤️ by the author

The Fully Connected Layer

The final step is the neural network. Until this last step, we have done some "deterministic" operations, just algebraic computation. In this step, we have real artificial intelligence. All has been done before, had only the purpose to generate data that can be understood by the network.


Conclusion

Since I moved my first step into this field, in 2008, there were relevant changes. The first feeling as an amateur, that now all the process is "deterministic". By using standard technologies and good documentation, it is easier to make a network work. The experience is still important, and I don’t want to compare AI with a regular database readwrite operation, but finding a lot of stuff, tutorial, guides is something that allows a newbie to get something working.

The credit for this is to the big players, as usual, they make accessible AI to developers, sharing their libraries, maybe just to let us know it’s easier to consume all the stuff from them as service. 😃

The big advice to developers, especially the youngest ones, is that, even there are a lot of facilities for implementing neural networks, is to focus on a theory first. Understanding how things work behind the hood is important when the system that you will build on top of the library will stop working and you won’t be able to understand why.


References


Related Articles