Using Deep Learning to identify dog breeds
Continuing our discussion on computer vision (see this article for a detailed introduction to the field) we will build a Deep Learning model to classify a dog into one of 120 breeds from it’s image. For this, we will be using Google’s Tensorflow platform in Python.
Convolutional Neural Networks (CNNs): A blueprint
Before we build the actual model, it is worth discussing the building blocks that underlie Convolutional Neural Networks. Each model is made up of several "layers" stacked together- each of which has a specific function. I will briefly discuss the intuition around the most important layers below, but there are many resources that provide a more detailed documentation around how these layers work and how they can be fine-tuned for a specific task (I would start with the official documentation from Keras).
- Convolutional LayersConvolutional Layers are used to extract features from images, as was discussed in Part 1 of this series. We set up convolutional layers by specifying how many features we want to extract from the image, and the size of the convolutional matrix we want to use. We don’t need to tell the model which features to extract: for instance, we don’t need to tell it to detect edges and outlines- the model "learns" this as it is given data to train on.
-
Pooling Layers Pooling Layers are used to reduce the dimensionality of the data, by "summarizing" the information contained in a specific segment of the image. An example of this would be reducing a 44 grid to a 22 grid by representing each segment by it’s maximum value (we can also choose the average value instead of the maximum value). Pooling has a couple of purposes – it reduces dimensionality, and it makes the model less sensitive to the exact location of a feature, which is desirable because we want the model to recognize a feature even if it slightly to the left or right of its reference location.

- Dense LayersDense layers are made up of a fixed number of "neurons" or cells that take 1 dimensional inputs from (in the case of CNNs) the convolutional layers, and process it for further use- the output of these layers is either fed forward to other dense layers or used to predict the final output. For instance, if we have a convolutional layer that extracts 64 features from an image, and we want to use these features to reach our prediction. We can pass this information to a dense layer, with, for instance 16 nodes. Each node in the dense layer is fully connected to the convolutional layer, i.e. it collects information from all 64 features. Each node then applies a different set of weights to the inputs from each feature and arrives at a "score" that is then fed to other dense layers or used to predict the outcome.
-
Flatten and Dropout LayersConvolutional layers return a 2D output (since an image is processed as a 2D grid), but our Dense layers take in only 1D inputs. To allow these layers to communicate, we need to "flatten" this information from 2D to 1D. We can do this by using "global pooling" i.e. by representing the entire image by a single summary figure, or by using a "Flatten" layer. Dropout layers are used to prevent the model from overfitting to the data. A layer with a dropout rate of 30% for instance will tell the model to randomly ignore 30% of the nodes of the preceding layer each time. This means that the model has to ‘generalise’ well in order to give accurate outputs when 30% of it’s nodes are going to be ignored randomly.
Quick note on overfitting vs underfittingThe best way to understand overfitting and underfitting is with the use of an analogy. Let’s imagine that we have three students who are studying for an exam. A’s preparation consists of memorizing the course material, B has taken his time to understand the concepts, and C does not bother preparing at all. An ‘overfit’ model is like A – it "memorises" the features of the dataset it was trained on. The reason we don’t want a model to be overfit is when we actually want to use the model on data it has not seen, it will perform poorly (just like A would in an exam that require the application of the concepts he is supposed to have studied). An ‘underfit’ model is like C – it hasn’t learnt the features of the training data, and we obviously don’t want that. The ideal model is like B- it learns from the data we train it on, but is able to generalise the features of this dataset and still make accurate predictions on new data.
Typically, we would use several convolutional and dense layers in the same model. A simple structure may look like this:
The Data
We will be working with a dataset of 10,000+ images of Dogs that belong to 120 breeds. The dataset is available here. We will give the bulk of this dataset to the model to train it, and will then see how accurately it is able to predict the breeds on the remaining images, which it has not seen before. The probability of randomly guessing the correct breed is about 1/120 – let’s see how well our model performs.
Data augmentation
For most computer vision problems, it is often a good idea to "augment" the dataset. We take the images we have, and then randomly transform them within some set parameters. For instance, we may rotate the images by upto 30 degrees, increase and decrease the brightness of the image by upto 20%, increase the zoom of the image, etc. This does two things- it increases the amount of data we have to work with, and it helps ensure that the model can still recognize (in this case) the breed of the dog, even if the image is slightly moved. This has obvious advantages since every new image we encounter is not going to have exactly the same zoom, brightness, etc.
To illustrate this, let’s meet Hazel. She is a Cocker Spaniel, and is not part of the dataset used to train or test the model.

Now let’s "augment" this image to create 15 images, each only slightly different from the other. This is what the output looks like:

The model
Now given the complexity of the problem (we have to differentiate between 120 breeds, many of whom look alike), a simple model like the one described earlier is not likely to be up to the task. We probably want a much deeper model, with more layers. The process of choosing the right model structure inherently involves a lot of trial and error. Luckily, there are some "pre-trained" models that are available to use, that have been trained (on a different dataset, of course) to categorise between up to a 1000 classes. We can use these models, and re-train all or only a part of the layers on our dataset and see how well this performs. This is called "transfer learning".
Here, I have use the DenseNet 121 model (it has a 121 layers and over 8 million parameters!). I have replaced the last dense layer which has a 1000 nodes with one that has 120 nodes (since the original model was meant trained on a dataset with a 1000 classes, whereas we only have 120 breeds) and retrained only the last 11 layers of the model.
The model architecture is detailed in their paper, and an excerpt is presented below:

Details on all available pretrained models as part of the Keras library can be found here.
Looking "under the hood"
Neural Networks are often a "black-box" due to their complexity. Therefore, before presenting the code and the model performance, it is a good idea to try and visualise what the model is "seeing" at different steps along the way to the final prediction.
The final model has 120 convolutional layers (each of which extracts several features from the image) and one dense layer. While it is not practical to see the results of all the convolutions that each image undergoes, I have used Hazel’s image as an example and presented below 3 "convolved" images that different layers of the final (trained) model generate. This will allow us to see what the model is seeing (even though it is only a small portion of the overall information the model uses to make it’s prediction).

As we can see, the features that are extracted in the first few layers might still be discernable to us- but by the time the model gets to the 27th (of 120) convolutional layer, the resultant image is basically unrecognizable to the human eye. (It is important to appreciate that we did not tell the CNN what features to look for, this is something it learnt on it’s own.) The last row shows the final set of images (notice, it is a 77 grid of pixels, each pixel being characterized by a different color) that is then "flattened" and passed onto the dense layer for prediction. That means that model reaches it’s prediction as to which of the 120 breeds of dogs the image shows based on a bunch of 77 images that we can no longer even associate to the original image. While this looks like gibberish to us, the CNN is carefully calibrated to recognize intricate patterns in these "dots" that help it make it’s prediction.
Model Performance
The model had an accuracy of ~75% in the test dataset (keep in mind, these are images that the model was not trained on). While the performance can be enhanced by tweaking some parameters and retraining more layers of the original DenseNet model, this is already pretty good considering the probability of randomly guessing the right breed is less than 1%, and many dog breeds look very similar to one another.
As far as predicting which breed Hazel belongs to, here is what the model had to say:
