This Model is for the Birds

Deep Learning Experiments with Cornell’s Bird Data

Daniel Morton
Towards Data Science

--

Computer vision has advanced markedly in the last decade, largely through advancements in the artificial neural networks popularly known as deep learning. A combination of technological advances and more sophisticated network designs have led to neural networks having state-of-the-art, and often better-than-human, accuracy in problems of classification, object detection, and segmentation.

Traditional benchmarks for deep learning architectures are usually done on the same half-dozen or so image data sets. This has the obvious advantage of allowing for apples-to-apples comparisons, but has the disadvantage that all benchmark models are trained on the same small set of problems, which leaves open the question of how well these benchmarks generalize. Apart from being a necessarily limited sample of images, all the training and validation images have all had to be curated and may not reflect the sort of images found in the real world.

I prefer to go out into the wild. Literally. If you’re going to learn a new technology, and this is my first project with Deep Learning, it’s best to find a interesting application. I couldn’t think of an application more interesting than species detection. Working with images from one class of organisms has some special challenges. In recent years, a popular collection of fine-grained image classification data sets have emerged, which focus on classification problems where detecting subtle differences is required to correctly identify images. Identifying bird species is a perfect example. The exact number is subject to debate, but the number of bird species is somewhere in the neighborhood of 10,000 with almost two thirds of those being passerines (i.e. songbirds.) For all that variety, there is a remarkable uniformity in body type, with a number of species differing only in slight color patterns on the wings or head. Is that a Savannah Sparrow or a Vesper Sparrow? It’s not always easy to tell.

Vesper Sparrow (Ryan Schain/ Macaulay Library at the Cornell Lab ML41643251) and Savannah Sparrow (Don Blecha/ Macaulay Library at the Cornell Lab ML55502961).

How about the Downy and Hairy Woodpeckers? Without a good look at the beak, even experienced birders can be confused.

Downy Woodpecker (Evan Lipton/ Macaulay Library at the Cornell Lab ML47227441 ) and Hairy Woodpecker (Nate Brown / Macaulay Library at the Cornell Lab ML52739271)

In the interest of contributing to research in fine-grained vision classification, the Cornell Lab of Ornithology has released the NABirds data set consisting of 48,562 images of 404 bird species. Many of these species are further subdivided into categories such as Male/Female, Adult/Juvenile, Breeding/Non-breeding, which leads to 555 total classes. My goal is to determine how I can get the best prediction accuracy on these 555 classes with different model architectures.

The data is roughly evenly split between training and validation (23,929 training and 24,633 validation). The number of training images per class is quite variable, with 60 images for about a quarter of classes, and single digits for some of the rarer varieties (4 for the white winged dark-eyed junco and 5 for the female harlequin duck.)

Distribution of training examples.

The distribution of validation images is similar to that for the training images. Classes with few training images tend to be infrequent visitors to the validation set.

Image size is also quite variable, and will be an issue later on. More than half have a width of 1024, and most of those have heights between 600 and 900 (and most of those around 700 to 800). Mini-batch training for deep learning requires that all inputs have the same dimension. I will experiment with different cropping and resizing strategies.

Each image also comes with a bounding box covering the portion of the image containing the bird. There is one bird per image and thus one box per image. I won’t try to predict bounding boxes as part of this exercise, but I will use them in preprocessing for training and some validation

Bird Images with Bounding Boxes. From the NABirds DataSet.

Model Selection

This article will focus on some of the simpler techniques for getting good accuracy on a fine-grained classification problem. The first, and most obvious, is image resolution. By their very nature, fine-grained classification problems are dependent on detecting subtle details that could be lost if the image resolution is decreased too much. I took three image resolution sizes, 224x224, 300x300, and 600x600. These are the default resolutions for the EfficientNetB0, EfficientNetB3, and EfficientNet B7, which are the three model architectures I decided to use for my experiment. All networks are pretrained on ImageNet.

This is the second variable factor I considered, model size. EfficientNet is a family of models based on one architecture but with depth (number of layers) width (number of channels per layer) and input resolution scaled uniformly. The original EfficientNet paper proposes eight networks; in the interest of time I only used three: small, medium, and large. (Before anyone asks why I used B3 instead of B4 as medium, it was a coin toss.) All models were pre-trained on ImageNet.

The third parameter that I explored is the learning rate. I do not pretend to have made an exhaustive search of the space of possible learning rates, if such a thing is even possible, but I think I have some reasonable insights. Sticking to some variations on a common theme, I started with a learning rate of 1e-3 and decreased it by a factor of 0.94 either every 4 or every 8 epochs. Dropping every 4 epochs is a standard procedure for fine-tuning networks; dropping every 8 is an adaptation to account for the larger batch sizes, and thus fewer steps per epoch, permitted by using a TPU.

Preprocessing Training Data

The basic idea of machine learning is that with a representative set of training data and a model with tunable parameters, the training data can be used to find a set of parameters that allow the model to make accurate predictions when given a new set of data. In the case of image classification, the word “representative” is tricky. The space of all possible images, even on one particular subject, is so large that it is nearly impossible to account for all possibilities. There are also practical considerations. Any input of a neural network has to be of some resolution, the tensor framework for mini-batch training requires a fixed input resolution, and memory considerations put a limit on how fine the resolution can be.

The good news is that these restrictions still allow for a lot of flexibility. Data augmentations can artificially increase size of the training data. In the case of data sets, like NABirds, where input image sizes vary, resizing and rescaling algorithms are now standard. And computing power, especially with TPUs, is now such that relative fine resolutions can be used as inputs.

It should also be added that much information can be retained when going from high to low resolutions. In these woodpecker images, the resolution is successively decreased by a factor of two. The nature of the bird, if not its exact species, remains clear in the first four images. Even in image five, woodpecker remains a reasonable guess. Only at the lowest resolution is the viewer reduced to speculation.

One reason this works is, of course, that the woodpecker is in the foreground. If the bird occupied a smaller part of the picture, recognizability will deteriorate faster. The Parula below is only visible is the first four images.

From the NABirds Dataset

There are a number of preprocessing schemes that have become standard in deep learning. Before switching to EfficientNet, I had been working with Inception architectures and stuck with a variation of the inception preprocessing scheme. Once an image is loaded, a random crop is taken. This crop must obey the following conditions:

  • It must contain at least 50% of the bounding box. This is to avoid crops that don’t have much, if any, bird in them.
  • It must contain at least 10% of the total image.
  • It must have an aspect ratio between 4:3 and 3:4.

The crop is then resized to fit the input dimensions, one of 224x224, 300x300, or 600x600 depending on the model. Since the original crop is likely not a perfect square, this will lead to some squeeze or stretch distortion, but only enough to introduce some variety into the input.

The next preprocessing stage takes this square and performs a series of random color adjustments, changing hue, brightness, saturation, and contrast. For the most part, this could be seen as adjusting image for different lighting conditions. The image also get flipped horizontally with probability 0.5.

A sample of images after the first two stages of preprocessing is below.

Originals from the NABirds Dataset

The final preprocessing step normalizes the input values for EfficientNet.

The response variable is also preprocessed. As a modest defense against overfitting, I apply the label smoothing procedure as described in the InceptionV3 paper.

Preprocessing Validation Data

A dirty little secret of image classification models is that their validation accuracy is very dependent on how the validation set is chosen. I once had a model’s validation accuracy decrease from 90% to 70% just because I had forgotten to change the input resolution. Incidentally, that was the inspiration for this project.

In this experiment, I chose four separate validation preprocessing steps. The main one I consider takes the validation images and rescales them to fit the size of the training images. This is the standard method of evaluating validation sets with the exception of a center cropping, (I saw no virtue in simply removing part of the image), and my decision to pad with black borders rather than squeeze and stretch the image to fit a square input.

The other three strategies, and their rationales, are below:

  • Rescale the portion of the image in the bounding box to fit the size of the training images. This strategy gives me images that are closer to the training set, but is something of a cheat since it requires foreknowledge of the bounding boxes, which I would rarely have in a real world application.
  • Use the raw image with no preprocessing except normalizing the input. In some respects, this is the ideal situation; pass any image in and the model produces an output. Since the image are of varying size, I can only use a batch size of 1, and I can only get away with that because the validation set is small.
  • Use the raw image inside the bounding box with no preprocessing except normalization. As an avid user of iNaturalist I know that cropping the picture around the organism of interest is often the best way to get a good ID. As we will see, removing extraneous detail from the image, even without any change in resolution, has a strong effect on accuracy.

A sample of the four types of validation input (not to scale) is below.

Originals from the NABirds Dataset

Model Training

Training was done using Google Colab with a TPU backend. The standard mini-batch size size used was 512 (64 * 8 cores), although I dropped that by factors of two (i.e. 256, 128, even 64) in cases where the image size or the neural network were too large to allow a 512 batch to fit into memory. This usually happened when dealing with 600x600 image and/or EfficientNetB7. All models were trained for 300 epochs, although I think the larger models still had room for improvement. The optimizer was Adam with default parameters, and the loss function was standard categorical cross-entropy.

Results on Standard Validation Data

Validation Results for Models with Learning Rate Decay every 4 Epochs.
Validation Results for Models with Learning Rate Decay every 8Epochs.

The broad trajectory of the results is as expected. The higher the input resolution, the better the model does. The larger the model, the better the model does. The first interesting result is that resolution is more important than model size. An increase from 224x224 to 300x300 is usually worth about 6 percentage points of validation accuracy and an increase from 300x300 to 600x600 is usually worth about 8 percentage points. This is quite consistent over all the model sizes.

Conversely, keeping the resolution fixed but increasing the model size is only worth about a 3–5 percentage point increase. This is consistent among all input resolutions.

My theory that a slower learning rate decay (every 8 epochs instead of every 4) would work better with larger mini-batch sizes was not correct. The slower decay rate always resulted in slightly lower, usually about 0.5%, accuracy.

Results on Other Validation Data

As expected, cropping has a big impact on validation accuracy. Only using the input in the bounding box, as opposed to using the the full image, but rescaled to match the training input, will buy as much as 14% accuracy, and usually at least 9–10%. The effect is most dramatic at 224x224 resolutions, and for smaller models, but is still significant even for high models and resolutions.

Global Average Pooling (Alexis Cook)
Standard vs Bounding Box Cropping

To understand why accuracy consistently increases after cropping, it is instructive to consider the last few stages of a modern deep learning model. The last stages are typically global average pooling, optional batch normalization and dropout layers, and a final dense output layer. The input of the global average pooling layer is an HxWxD tensor (L and W are the height and width, and D is then number of channels) and the output is a D-length vector computed by averaging over all the channels. Arguably, each of the D-length (HxW of them)vectors that make up the input tensor is the final layer of a convolutional network that processes a subset of the original image. Indeed, this is the basic idea of object detection; a classification model is run on each D-length vector to determine if there is an object. In this case, each D-length vector encodes the likelihood of each bird species appearing in the part of the image that feeds into it. If a bird takes up most of the image, as it does when cropped input is used, most of the D-length vectors will encode that and thus their average will encode it as well. If the bird is in a small portion of the image, most the D-length vectors will be inconclusive, and the final average after global pooling will be similarly confused.

Standard vs No Preprocessing Accuracy.

If, instead of using cropped images, I use the original image with no resizing or rescaling, accuracy goes down, often by the same amount it went up in the previous example. The only exception is models trained on 600x600 resolution. Since this resolution is closest the original image size, the training and validation sets are closest in this situation. It’s possible that the increase in accuracy over the standard model is a result of the images not being resized to fit a predetermined input. The decrease in accuracy for the other models can be attributed to the same averaging procedure described above.

The final case to consider is where the cropped image inside the bounding box is submitted to the model with no resizing. For the most part, this is more accurate than the standard method of submitting images, but less so than cropping and resizing. The one glaring exception is the largest model trained on the highest resolution, where accuracy drops by double digits. My best guess as to why that happened, is that since the model was both trained and pre-trained on high resolution images, the small images inside the bounding boxes confused it. Even when the model is correct, the softmax output rarely has high confidence.

Conclusion

What conclusions can we draw from this experiment. The obvious ones are that larger models outperform smaller ones, and that higher image resolution is better than lower resolution. Neither conclusion is surprising, but the fact that image resolution has more of an impact on accuracy than model size is worth noting. This means that high accuracy on a small model like EfficientNetB0 is possible if the training input is chosen intelligently. And if small accurate models can be trained, they can be put into production.

The more significant conclusion is that choosing how to preprocess validation input is important. Ideally, the validation input should match the types of input expected in the eventual production application. The need to batch inputs in training and validation may be a limiting factor, but as close to production input as possible should be the ideal. This may or may not inform how training data is selected.

Further Work

The results above still leave room for improvement. As of this writing, I am experimenting with noisy-student as a replacement for ImageNet weights. So far this seems to buy me a 1–2% increase in accuracy, with larger models improving by a wider margin.

The original EfficientNet models used AutoAugment to preprocess training input. This suggests that my choice of Inception style preprocessing, although easier to implement, may not have been ideal. I am currently working on translating one implementation of AutoAugment, and RandAugment in the margin, to Tensorflow 2, so I can easily insert it into my training pipeline. I would expect a modest boost in accuracy.

NABirds come with bounding boxes, which I have so far used only indirectly. An explicit bounding box detector would have obvious value. With the recent release of EfficientDet, a family of object detection architectures using EfficientNet as a backbone, an experiment similar to this one, but evaluating both detection and localization would be valuable.

References

[1] Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Be- longie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In CVPR, 2015.

[2] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.

[3] C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.

--

--