Dog Breed prediction using CNNs and transfer learning

Jesse Fredrickson
Towards Data Science
17 min readAug 2, 2019

--

In this article, I will demonstrate how to use keras and tensorflow to build, train, and test a Convolutional Neural Network capable of identifying the breed of a dog in a supplied image. Success will be defined by high validation and test accuracy, with precision and recall scores differentiating between models with similar accuracy.

This is a supervised learning problem, specifically a multiclass classification problem, and as such the solution may be approached with the following steps:

  1. Amass labelled data. In this case, that means compiling a repository of images with dogs of known breeds.
  2. Construct a model capable of extracting data from training images, which outputs data which may be interpreted to discern a breed of dog.
  3. Train the model on training data, validate performance during training with validation data
  4. Evaluate performance metrics, potentially return to step 2 with edits to improve performance
  5. Test the model on test data

Each step of course has a number of sub-steps which I will detail as I go.

Prelude

The act of training a neural network, even a relatively simple one, can be extremely computationally expensive. Many companies use server racks of GPUs dedicated to this kind of task; I will be working on my local PC which is equipped with a GTX 1070 graphics card which I will recruit for this exercise. In order to perform this task on your local computer, there are a few steps you must take in order to define a proper programming environment, and I will detail them here. If you are not interested in the backend setup, skip to the next section.

First, I created a new environment in Anaconda, and installed the following packages:

  • tensorflow-gpu
  • jupyter
  • glob2
  • scikit-learn
  • keras
  • matplotlib
  • opencv (This is for identifying human faces in image pipeline — not a necessary capability, but useful in some applications)
  • tqdm
  • pillow
  • seaborn

Next, I updated my graphics drivers. This is important because driver updates are pushed out fairly regularly, even for a card like mine which is 3 years old, and if you are using a modern version of tensorflow is it necessary to use up-to-date drivers for compatibility. In my case, drivers that were only 5 months old were incompatible with the newest version of tensorflow.

Finally, open a jupyter notebook from an anaconda prompt in the new environment in order to perform your work, and ensure jupyter is using the correct environment for the kernel. Tensorflow may encounter issues if you don’t do this.

As a sanity check, post module imports I can call the following to show available CPUs and GPUs.

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

Success! There’s my GTX 1070.

Finally, I execute the following code block which edits some configuration parameters of the tensorflow backend and prevents some runtime errors in the future.

# tensorflow local GPU configuration
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

Step 1: Compile Data

In my case, this was trivial, as Udacity provided me with 1.08Gb of dog images spanning 133 breeds, already in a proper file structure. Proper file structure, in the case of a classification CNN built with keras, means files are segregated by training, validation, and testing, and further segregated within these folders by dog breed. The names of each folder should be the names of the classes which you plan to identify.

Obviously, there are more than 133 dog breeds in the world — the American authority, the AKC, lists 190 breeds, and the world authority, the FCI, lists 360 breeds. If I wanted to increase the size of my training dataset to include more breeds or just more images of each breed, one avenue I could pursue would be to install the python Flickr API and query it for images tagged with the names of whatever breed I desired. However for the purposes of this project, I continue with this basic dataset.

As an initial step, I will load all of the filenames into memory for easier processing down the road.

# define function to load train, test, and validation datasets
def load_dataset(path):
data = load_files(path)
dog_files = np.array(data['filenames'])
dog_targets = np_utils.to_categorical(np.array(data['target']))#, 133)
return dog_files, dog_targets
# load train, test, and validation datasets
train_files, train_targets = load_dataset('dogImages/train')
valid_files, valid_targets = load_dataset('dogImages/valid')
test_files, test_targets = load_dataset('dogImages/test')
# load list of dog names
# the [20:-1] portion simply removes the filepath and folder number

dog_names = [item[20:-1] for item in sorted(glob("dogImages/train/*/"))]
# print statistics about the dataset
print('There are %d total dog categories.' % len(dog_names))
print('There are %s total dog images.\n' % len(np.hstack([train_files, valid_files, test_files])))
print('There are %d training dog images.' % len(train_files))
print('There are %d validation dog images.' % len(valid_files))
print('There are %d test dog images.'% len(test_files))

which outputs the following stats:

There are 133 total dog categories.
There are 8351 total dog images.

There are 6680 training dog images.
There are 835 validation dog images.
There are 836 test dog images.

Next, I perform a step to normalize the data by dividing every pixel in the image by 255 and format the output as a tensor — a vector which can be used by keras. Note: the following code loads thousands of files into memory as tensors. Although this is possible with a relatively small dataset, it is better practice to use a batch loading system which only loads in a small number of tensors at a time. I do this in a later step, for the last model I design.

# define functions for reading in image files as tensors
def path_to_tensor(img_path, target_size=(224, 224)):
# loads RGB image as PIL.Image.Image type
# 299 is for xception, 224 for the other models

img = image.load_img(img_path, target_size=target_size)
# convert PIL.Image.Image type to 3D tensor with shape (224, 224, 3)
x = image.img_to_array(img)
# convert 3D tensor to 4D tensor with shape (1, (target_size,) 3) and return 4D tensor
return np.expand_dims(x, axis=0)
def paths_to_tensor(img_paths, target_size = (224, 224)):
list_of_tensors = [path_to_tensor(img_path, target_size) for img_path in tqdm(img_paths)]
return np.vstack(list_of_tensors)
# run above functions
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
# pre-process the data for Keras
train_tensors = paths_to_tensor(train_files).astype('float32')/255
valid_tensors = paths_to_tensor(valid_files).astype('float32')/255
test_tensors = paths_to_tensor(test_files).astype('float32')/255

Construct, Train, Test, Evaluate

There are an infinite number of ways do this, some of which will work much better than others. I am going to explore 3 unique approaches, and follow them from construction through to testing and evaluation. The approaches I take are as follows:

  1. Trivial solution. I will construct and train a very simple CNN on the dataset and evaluate its performance.
  2. Transfer learning with bottleneck features. I will make use of an existing CNN which has been trained on a massive image library, and adapt it to my application by using it to transform my input images into “bottleneck features”: abstract feature representations of the images.
  3. Transfer learning with image augmentation. Similar to the bottleneck features approach, but I will attempt to get better model generalization by creating a model which is a stack of a pretrained bottleneck feature CNN with a custom output layer for my application, and I will feed it input images which are randomly augmented by pictographic transformations.

To start with, I will demonstrate the trivial approach of creating a basic CNN and training it on the dataset.

Step 2a: construct trivial model

I create a simple CNN with the following code, using keras with a tensorflow backend.

from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.models import Sequential
model = Sequential()# Define model architecture.
model.add(Conv2D(16, kernel_size=2, activation='relu', input_shape=(224,224,3))) # activation nonlinearity typically performed before pooling
model.add(MaxPooling2D()) # defaults to pool_size = (2,2), stride = None = pool_size
model.add(Conv2D(32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D())
model.add(Conv2D(64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D())
model.add(GlobalAveragePooling2D())
model.add(Dense(133, activation='softmax'))
model.summary()

The model.summary() method prints out the following model structure:

Here I have created an 8 layer sequential neural network utilizing 3 convolutional layers paired with max pooling layers, and terminating in a fully connected layer with 133 nodes — one for each class I am trying to predict. Note in the dense layer I use a softmax activation function; the reason for this is that it has a range from 0–1, and it forces the sum of all nodes in the output layer to be 1. This allows us to interpret the output of a single node to be the model’s predicted probability that the input was of the class corresponding tot that node. In other words, if the second node in the layer has an activation value of 0.8 for a particular image, we can say that the model has predicted that the input has an 80% chance of being from the second class. Note the 19,000 model parameters — these are the weights, biases, and kernels (convolution filters) that my network is going to attempt to optimize. Now it should be evident why this process is intensely computationally demanding.

Finally, I compile this model so that it can be trained. Note that there are many loss functions and optimizers I could use here, but current common convention for multiclass image label prediction uses Adam for the optimizer, and categorical crossentropy as the loss function. I tested Adam against SGD and RMSProp, and found Adam trained much faster.

model.compile(optimizer=’adam’, loss=’categorical_crossentropy’, metrics=[‘accuracy’])

Step 3a: Train trivial model

Now I have a list of tensors on which to train, validate, and test a model, and I have a fully compiled CNN. Before I begin training, I define a ModelCheckpoint object, which will serve as a hook I can use to save my model weights as I go for easy loading in the future without retraining. To train the model, I call the .fit() method of the model with my keyword arguments.

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.from_scratch.hdf5', verbose=1, save_best_only=True)model.fit(train_tensors, train_targets, 
validation_data=(valid_tensors, valid_targets),
epochs=3, batch_size=20, callbacks=[checkpointer], verbose=2)

As you can see, I am only running this model for 3 epochs as I know it will not be a high performer due to its simplicity — this model is purely for demonstration purposes. The model training gives the following output:

The model finished training with a training accuracy of 1.77%, and a validation accuracy of 1.68%. Although it’s better than random guessing, it’s nothing to write home about.

As a side note, when training this model, we can see an immediate jump in my GPU usage! This is great — it means that the tensorflow backend is indeed using my graphics card.

Step 4a: Evaluate trivial model

The model did not achieve a reasonable accuracy on the training or validation data, evidence that it underfit the data considerably. Here I show a confusion matrix for the model’s predictions, as well as a classification report.

The model predicted one of two classes for almost every input image. It seems to be partial to basset hounds and border collies. Unfortunately, I have not created sentient AI that has a favorite dog; there are just a few more pictures in the training set of basset hounds and border collies than most other categories, and the model learned this. Due to the drastic underfitting of the model, it is not worth exploring the precision or recall it achieved at this time.

Step 5a: Test trivial model

Finally, I test the model on the test dataset.

# get index of predicted dog breed for each image in test set
dog_breed_predictions = [np.argmax(model.predict(np.expand_dims(tensor, axis=0))) for tensor in test_tensors]
# report test accuracy
test_accuracy = 100*np.sum(np.array(dog_breed_predictions)==np.argmax(test_targets, axis=1))/len(dog_breed_predictions)
print('Test accuracy: %.4f%%' % test_accuracy)

This yields a test accuracy of 1.6746% — in line with what I was expecting. If I were to train the model on more epochs, it is likely that I could achieve a higher accuracy, but this model is highly simplistic, and it would be a better idea to revise my architecture. Soon I will demonstrate better methods of model building using transfer learning which can achieve much higher accuracy.

Step 2b: Construct bottleneck features model

One way that I can dramatically improve performance is to utilize transfer learning — that is, I can leverage an existing CNN which has been pretrained to recognize features of general image data, and adapt it for my own purposes. Keras has a number of such pretrained models available for download and use. Each is a model which has been trained on a repository of images known as imagenet which contains millions of images distributed across 1000 categories. Models trained on imagenet are typically deep CNNs with a number of fully connected output layers which have been trained to categorize the hidden features exposed by the convolution layers into 1 of those 1000 categories. I can take one of those pretrained models and simply replace the output layers with my own fully connected layers, which I can then train to categorize each input image as one of my 133 dog breeds. It is important to note here that I am no longer training a CNN at all — I am going to freeze the weights and kernels of the convolution layers, which are already trained to recognize abstract features of an image, and only train my own custom output network. This saves an immense amount of time.

There are at least two ways I can go about this. One way would be to stitch together the pretrained network and my custom network, as outlined above. Another, even simpler way, is to feed every image in my dataset through the pretrained network, and save the outputs as arrays to then feed through my network later. The benefit of the latter method is that it saves computing time, because each training epoch I am only doing a forward pass and backprop through my own model, instead of the imagenet model and my model together. Conveniently, Udacity already fed all of their supplied training images through a few built-in CNNs, and provided the raw output, or bottleneck features, for me to simply read in.

Here I define my own fully connected network to accept bottleneck features and output 133 nodes, one for each breed. This one is for the VGG16 network, used as an example. I use different networks for my actual training which will be seen in the next section.

VGG16_model = Sequential()
VGG16_model.add(GlobalAveragePooling2D(input_shape=train_VGG16.shape[1:]))
VGG16_model.add(Dense(133, activation='softmax'))

A few things to note here:

  • I start with a GlobalAveragePooling layer — this is because the last layer of the VGG16, and in fact all of the imagenet models I test, is a convolution/pooling sequence. Global Pooling layers reduce the dimensionality of this output, and greatly reduce the training time when feeding into a Dense layer.
  • The input shape of the first layer in my network must be tailored to the model for which it is designed. I can do this by simply getting the shape of the bottleneck data. The first dimension is cut off of the bottleneck feature shape to allow keras to add a dimension for batch processing.
  • I use a softmax activation function again, for the same reasons outlined for the trivial model.

Step 3b: Train bottleneck features model

Udacity provided the bottleneck features for 4 networks: VGG19, ResNet50, InceptionV3, and Xception. The following code block reads in the bottleneck features for each model, creates a fully connected output network, and trains that network over 20 epochs. Finally, it outputs the accuracy of each model.

As is evident from the last line, all 4 models performed much better than my own trivial CNN, with the Xception model achieving a validation accuracy of 85%!

Step 4b: Evaluate bottleneck feature model

The best performers — Xception and ResNet50 — both achieved remarkable validation accuracy, but digging through the logs, we can see that their accuracy on the training data was nearly 100%. This is the trademark of overfitting. This isn’t too surprising, Xception has 22 million parameters, and ResNet50 has 23 million, meaning both models have an enormous entropic capacity and are capable of just memorizing the training data images. To combat this, I will implement some changes to my fully connected model and retrain.

I’ve added a second dense layer in the hopes that the model will be able to rely a little less on the pretrained parameters, and I’ve also augmented both fully dense layers with L2 regularization and dropout. L2 regularization penalizes the network for high individual parameter weights, and dropout randomly drops network nodes during training. Both fight overfitting by requiring the network to generalize more during training. Also note I’ve changed optimization strategies; in a real research environment, this would be done with GridSearch, which accepts lists of hyperparameters (such as optimizers with ranges of hyperparameters) but in the interest of time I experimented with a few on their own. Note that I’ve switched back to using SGD — through experimentation I’ve found that although Adam trains extremely quickly, given enough training epochs, SGD consistently surpasses Adam (a finding hinted at by this paper).

After training for 100 epochs (5 minutes):

The model achieved comparable validation accuracy to before, but the training accuracy is much lower. The low training accuracy is due to the dropout, because it is never using the full model to evaluate training inputs. I’m satisfied that this model is no longer overfitting to the same degree as before. It appears that the validation accuracy and loss have both roughly leveled out — it’s possible over another 100 epochs I could squeeze out another 1–2% accuracy, but there are more training techniques I can employ first.

Step 5b: Test bottleneck feature model

Almost 83% accuracy on the testing dataset. Very similar to the validation set as hoped. Looking at the confusion matrix:

Much better than the previous one. We can see here that there are a few breeds of dogs that the model performs quite well on, and a few where it really struggles. Looking at an example of this, it becomes quite clear why.

Let’s zoom in on the outlier halfway up the y axis and at about 1/4 of the x axis.

The model consistently thinks that the 66th class is actually the 35th class. That is, it thinks that a field spaniel is actually a boykin spaniel. Here are those two breeds side by side.

Field Spaniel (left), Boykin Spaniel (right)

Notice any similarities? Clearly, distinguishing these two breeds is an incredibly difficult task. I suspect that tweaking my model parameters would not result in a meaningful improvement in classification in this case, and in a real scenario I would train a binary classifier to differentiate between these breeds, and employ it in a classification prediction pipeline if the primary model predicted either class. But for now, I’m interested in attempting to get better performance by augmenting my training data.

Step 2c: Compile augmented input model

In models trained on image data, there is a form of bootstrapping the training data called image augmentation, where during training I can apply random rotations, zooms, and translations to the training images. This has the effect of artificially increasing the size of the training data by altering the pixels of the training images while maintaining the integrity of the content; e.g. if I rotate the image of a corgi by 15 degrees and flip it horizontally, the image should still be recognizable as a corgi, but the model will never have seen that exact image during training. The hope is that this technique will both improve model accuracy over a great number of epochs, and also fight overfitting. In order to do this, I can no longer use the bottleneck features I was using before, and I must compile the entire model together for forward and back propagation. Also note since I still do not want to edit the parameters of the imagenet pretrained model, I will freeze those layers before training.

First, I define my dataloading pipeline:

Next, I load the imagenet model, define a custom fully connected output model, and combine them into one sequential model. I switched back to using Adam here due to the time involved in training. If I had more computational resources, it would probably be worth using SGD as before.

Step 3c: Train augmented input model

This model takes significantly longer to train than any of my previous models simply because each forward pass has to traverse all of the nodes of the imagenet model. Each epoch takes around 3.5 minutes to train on my GPU, as opposed to the mere seconds it took with the bottleneck features model. This exposes the computational gain we got from using bottleneck features previously.

The model achieved relatively high accuracy on the training and validation datasets extremely quickly — this is thanks to my switch to the Adam optimizer. Note the training accuracy is still below the validation accuracy — this is because I am still using dropout. Another thing to notice is the high variation in validation accuracy. This can be a symptom of a high learning rate (looking at you, Adam), or a high entropic capacity (too many parameters). It does seem to level out over time, so I am not concerned.

Step 4c: Evaluate augmented input model

Looking at the classification reports for this model and the previous one, both scored .80 on both precision and recall during validation. Both also ended up with validation loss around 1, further suggesting there was no improvement. I was hoping to see a jump in accuracy due to the augmented training data, but I think that it would simply require an order of magnitude more training epochs before that improvement became evident. I suspect that changing to the SGD classifier and running for a lot more epochs would help.

Step 5c: Test augmented input model

I can feed the augmented model testing data in the same way that I fed it training and validation data, with a keras ImageDataGenerator:

The test accuracy is in the same ballpark as the bottleneck features model, albiet a few percent lower.

To drill into precision and recall, I perform the same analysis as before, with a confusion matrix:

What’s really interesting here is that we see the same outlier as before, though it is dimmer, suggesting this model did slightly better differentiating between the spaniels. However, there’s another bright outlier near the center of the matrix now — zooming in reveals that this model can’t differentiate between a Doberman Pinscher and a German Pinscher.

Doberman (left) and German (right) Pinschers

Go figure.

End Product

The last step for me is to write a function that will load a given model from scratch and accept image data as an input, and output a breed prediction. I will proceed with my augmented image model, because I believe in the future I can improve it with more training.

During the training of each of the models I tested here, I saved the parameters to .hdf5 files. Therefore, once training is complete, given that I know how to compile the model I am interested in using, I can load the best weights from my last training run on command. Then in my prediction function, I simply need to recreate the image processing steps I performed during training.

Since I have stored the model weights in an external file and I know how to recreate the model architecture, I can package said model for use anywhere, including in web or mobile applications. In fact, on the Play Store at the time of writing this post I saw that there is a dog breed identification app, which I suspect utilizes a similar model to the one I ended up with here.

--

--