A Newbie’s Introduction to Convolutional Neural Networks

Chi-Feng Wang
Towards Data Science
8 min readJul 16, 2018

--

This is a dog.

Image 1: Dog // Source

And this is a cat.

Image 2: Cat // Source

When our eyes see these two pictures, our mind immediately tells us what kind of animal is being shown. Easy, right? But what if you had to teach a machine to distinguish cats and dogs?

If we rely solely on the logic-based (“traditional”) programming that we learn in school, it’s almost impossible. We can attempt to categorize characteristics of cats and dogs, saying that cats have pointy ears and sharp claw, and dogs generally have flappier ears and duller claws. Therefore, if the animal has pointy ears and sharp claws, then it’s a cat. If not, then it’s a dog.

There comes a lot of problems, however. Firstly, there a lot of exceptions. There are dogs with pointy ears and cats with flappy ears. Also, there are a lot of pictures where cats don’t show their claws. Unless we manually program all the exceptions–and there will be a lot–it’s hard to simply divide the two animals simply by ticking off their characteristics off a list. Secondly, how will the machine recognize whether or not the picture includes a pointy ear? Images don’t come with labels saying {pointy ears: True, sharp claws: True}, meaning we have to find a way of determining it ourselves. When we look at a picture, we automatically see the pointy ears and sharp claws, but how will a machine learn that?

When we see an image of a dog, certain neurons in our brain are stimulated, sending signals to other neurons which send signals to even more neurons, ultimately resulting in certain neurons being fired that “tell” us that we see a dog. Neural networks attempts to simulate that process, building a “mini-brain” that can complete simple tasks such as distinguishing cats from dogs.

Image 3: Basic neural network // Source

The most basic neural network looks something like this. We start out with an input layer of neurons, which activate neurons in the hidden layers, which then activate neurons in the output layer. Think of each circle in the diagram above as a neuron. Each neuron contains a number, knows as its activation.

In the dogs VS cats example, we are given a photo and we have to distinguish whether it is a dog or a cat. Therefore, our input is the photo, and our output is the probability of it being a cat and the probability of it being a dog.

For the sake of simplicity, let us only give the machine black and white photos. Assume that we pass in a photo of dimension 64 x 64 pixels: we have 64 x 64 = 4096 pixels. Each of these pixels holds a number corresponding to the greyscale value of the pixel. Each pixel in the image below has a number from 0 to 255, with 0 being black and 255 being white.

Image 4: Input image // Source

The input layer, thus, will be composed of 4096 neurons with activations lined up together, or an array (list) of 4096 numbers.

What happens when we pass in a photo that is not of size 64 x 64 pixels? Neural networks have a set input layer size, which means that it has to analyze photos with 64 x 64 pixels. If the photo we pass in is larger than that, we can program the machine to only analyze the middle 64 x 64 pixels, or shrink the entire photo down until it reaches that size. If the photo we pass in is smaller than 64 x 64, we can enlarge the photo, or simply not analyze it. Of course, the number 64 is something that I chose; you can change the input size with every run-through. Computer scientists have found experimentally that analyzing square portions (e.g. 64 x 64 instead of, say, 64 x 70) produce better results, so typically we stay with analyzing square portions.

On the other end, the output layer should only contain two neurons: one neuron for the probability of it being a cat, and one for being a dog. Ideally, if we pass in a photo of a cat, we should get a result like this:

Image 5: Cat output

or, numerically, an array like this: [1.00, 0.00].

If we pass in a photo of a dog, we should get the opposite result. However, more likely, a well-trained neural network might produce a result like this:

Image 6: Dog output

or, numerically, an array like this: [0.98, 0.02].

The more well trained a neural network is, the closer it gets to the correct answer.

In between the input and output layers, there are several hidden layers. In the image above, only one hidden layer of 5 neurons is shown, but most neural networks have multiple hidden layers with many neurons. Ideally, we can imagine each layer as having a particular purpose; for example, the second layer (first hidden layer) will recognize the outline of the animal, the third layer will recognize certain shapes (such as circles), the fourth layer will recognize animal parts (for example, a circle within a circle may be an eye and a pupil), and the last (output) layer will recognize whether it is a cat or dog based on the characteristics of the animal parts.

How does that work? This website demonstrates how to analyze characteristics of an image very well. Here, the input image of a man is transformed into an output of the outline of the man.

Image 7: Convolution layer

This is simply done by grabbing 9 pixels (3 x 3, as seen in the top left corner of the input image) and multiplying each of the 9 pixels by a certain number, and adding it up together. In this example, the 9 pixels are multiplied by these numbers:

Image 8: Convolution kernel

The top left pixel greyscale number is multiplied by -1, the top center pixel greyscale number is multiplied by -1…etc. Then, all the numbers are added up together, and this new greyscale number is the number of the correlating pixel in the output image. Here, the output is -172, so the pixel in the red box in the output image is black.

Or, simply put, the machine takes 9 pixels in a 3×3 matrix, and has it undergo a scalar multiplication product with another 3 x 3 matrix to produce the new greyscale number of the new image.

Then, the machine moves down to the next set of 3 x 3 pixels, like this:

Image 9: Convolution layer

It continues until a full output image is created, only containing the outlines of the original image.

The same can be done with the first two layers of our dogs vs cats neural network. The photo of the dog–or cat–is passed in, then goes through a transformation matrix that outlines the animal in the picture to create a new outline picture. Each pixel in the new outline picture is a neuron in the second layer.

More than one transformation can be done between two layers. For example, we may choose to highlight the vertical edges of a dog picture AND the horizontal edges. These are two transformations that require two transformation matrices and will produce two output images. In this case, the two images (and their pixels) can be placed together to form the second layer. In this way, the second layer does not have to contain the same amount of neurons as the input layer; in fact, none of the layers in a neural network have to have the same amount of neurons.

From the second layer, more transformations are done to the image to produce rest of the hidden layers. The last hidden layer then undergoes a final transformation which produces two numbers: the probability of it being a cat and the probability of it being a dog.

This video demonstrates the different layers of a neural network that recognizes letters, with the initial image undergoing several transformations until it is finally mapped to a certain letter.

Image 10: Convolution layer

Going back to this example, the 3×3 pixels in the input image is multiplied by the outline transformation matrix to produce -172, the activation for the pixel in the output image. We call the numbers in the transformation matrix (the -1s and the 8s) the weights of the image. To better visualize it, here is our neural network diagram once again:

Image 11: Weights

Because there is limited space, I’ve only represented the middle column of 3 pixels in the input image as the three neurons in the input layer. The top neuron in the hidden layer is the pixel produced in the output image. Each of the numbers the neurons are multiplied by (the -1s and the 8s) are the weights of the neuron.

Sometimes, we may want to do another operation other than matrix multiplication. For example, we may wish to move activations down a specific value. For example, we may use the 9 pixels and the transformation matrix, take its scalar product to get -172, then subtract 10 to get -162. In this case, the 10 would be its bias, meaning the neuron has to be at least 10 to be meaningfully active.

Every neuron has its bias and set of weights. With the correct weights and biases, the neural network will be able to distinguish cats from dogs.

--

--