This is a sequel to my previous post about image classification using the NABirds data set provided by the Cornell Lab of Ornithology. In this article, I move beyond simple classification, and consider object detection, which might be better termed image or object localization. Not only am I trying to determine if a certain object is in a given picture, the detection component, but I am also trying to figure out where, i.e. the location, in the image the object happens to be.
My own opinions notwithstanding, object detection is the term used universally in the literature and I will adhere to convention.
A detailed discussion of the NABirds data can be found in the earlier article; I will merely review the highlights here. There are 48,562 images, more or less evenly split between training and validation sets, of 404 species of birds. Apart from the images, there is a significant amount of metadata, including several levels of taxonomy and annotations for male/female, juvenile/adult, and other color morphs where appropriate. The end result is 555 categories at the finest level. At the coarsest level, which will be of more interest here, there are 22 categories. At no level is the data even close to being balanced; some categories are very common and others are exceedingly rare.
The goal of this project is twofold. First, I simply want to predict a bounding box containing a bird in the image. There is only one detection class, bird, and the model has to find where the avian is in the picture. The second task is to locate the bird, and predict which of the 22 top level categories, which roughly correspond to Orders in taxonomy, the bird belongs to. The categories, listed alphabetically by common name description, with image counts are listed below. Note that perching birds, which account for more than half of all bird species, is, by far, the largest group. Astute naturalists will also notice that one order, Charadriiformes (i.e. shorebirds), has been divided into several categories.
Bounding Boxes
Each image has only one labeled bird and one bounding box containing it. A bounding box is a small rectangle containing the object in the image the model is searching for and as little else as possible. There are three ways to label bounding boxes, all involving four coordinates. The simplest is to use the coordinates of the upper left and lower right corners of the box. (By convention, the upper left corner of the image is (0, 0) and the upper right is (width, height)). Alternatively, the lower right corner can be replaced by the height and width of the box. A third, and more numerically stable, system is to use the height and width, but replace the upper left corner with the center of the box. All three methods are used by object detection networks as inputs, but most convert to the third system when training the neural net.

The above picture is a good example. The bounding box completely encloses the sharp-shined hawk except for part of its tail (tails are tricky when bounding animals). The box is almost as tight as possible while still enclosing the whole animal. In our three notation schemes the box would be encoded as follows:
- upper left, lower right = (141, 115), (519, 646).
- upper left, width, height = (141, 115), 378, 531.
- center, width, height = (330, 380.5), 378, 531.
In practice, the coordinates are usually scaled to be a percentage of the width and the height (i.e. between 0 and 1). It is also common to use log scale transform of the height and the width of the boxes, as small errors in predicting the size of large boxes are less serious than small errors predicting the size of small boxes.
In principle, an image can contain any number of bounding boxes enclosing any number of different categories of object. I’m fortunate that each image in NABirds has only one box. It has little effect on the training, but will make evaluation of the model somewhat simpler.
Object Detection Networks
Neural networks for object detection build on the basic architecture of neural networks for image classification. Image classification networks consist of an amazingly complex combination of convolutional layers, skip connections, inception layers (a sort of fractal layers-within-layers architecture), pooling layers, batch normalization, dropout, and squeeze-and-excitation components, but the final layers are simple. A global average pooling (GAP) layer collapses the final tensor to a single feature vector, optional batch normalization and dropout layers provide regularization, and a final dense layer, usually with softmax activation, provides the final classification. Arguably, a neural network for image classification is simply a complicated feature extractor followed by logistic regression, with the GAP layer marking the transition point. Most object detection networks work by removing the top of the networks, starting with the GAP layers, and replacing it with something a bit more complicated.
That something involves anchor boxes. Each of these boxes is centered at some point in the image and has height and width some fraction of the image. Normally, there is a grid of anchor boxes of a given size that covers the entire image. A full set of anchor boxes is a collection of grids of varying resolutions; the larger the boxes, the fewer the grid points. An anchor box is a positive anchor for a given class if its overlap with an input bounding box for that class is large enough. If the overlap is too small the anchor is a negative anchor, and anything between these bounds is ignored. The typical measure for this overlap is IoU (Intersection over Union, which is exactly what it sounds like: the area of the overlap of the two boxes divided by their conbined area). Typical bounds are IoU > 0.7 for positive and IoU < 0.3 for negative.

Before GAP is performed, the final output tensor could be seen as a grid of feature vectors, each one predicting the presence of an object in a small portion of the image. A very simple (a little too simple) object detection model could then just attach a classification layer to each vector in the grid as well as a second layer that outputs four box coordinates. The model could then be trained with a loss function that is the sum of the softmax classifier and an MSE regression loss on the box coordinates (usually the center, width, and height for numerical stability).
As I said, that’s a little too simple, but not by much. The model described above would correspond to the coarsest level of anchor boxes since the final tensor before GAP has a much smaller width and height than the original image. In the case of the old VGG16 network, a 224×224 image leaves a 7x7x512 box before final processing. I’d only be able to detect 49 possible bound boxes, one each for the 32×32 sub-images made by partitioning the original into a 7×7 grid. In practice, I need to be able to search for thousands of possible box centers, at several resolutions.

I can introduce finer resolutions by using some the network’s earlier layers. In this case I have 14×14, and 28×28 grids available to me. Since they’re earlier in the network, the features contain less information, but I can get around that by upsampling the later features and adding the results back in. Each layer would then correspond to a grid of anchor boxes. This is still a little too simple for modern object detection models, which learn complicated relationships between the various resolution layers. In the finest tradition of neural networks, the layers that combine the resolution layers can themselves be stacked together to form richer features.

There are three more significant, although much simpler, changes that must be made before I have a good object detection network. First, softmax must be replaced by sigmoid at all classification points. Softmax does not really handle the absence of any object, which will be the most common output at most anchor points, so the presence or absence of each class must be predicted independently. Second, the classification and regression losses are on different scales; a weighting factor must be added in so that both responses train well. Depending on the model, class_loss + 10 box_loss or class_loss + 50 box_loss will work well. The third, and possibly simplest, addition is to have each anchor box center correspond to multiple anchor boxes of different aspect ratios.
EfficientDet
The basic premise of the Efficientnet series of networks is that a simple relationship between three major network parameters (number of layers, image resolution, and filters per layer) could be used to easily generate a family of models of various sizes for image classification, ranging from small models suitable for smartphones to the larger models necessary for state of the art accuracy. EfficientDet extends the same principle to object detection models. The basic EfficientNet backbones are used as feature extractors in the manner described above. Instead of a GAP layer at the end, the different resolution levels are further processed by a series of bidirectional feature pyramid networks (BiFPN). The number of BiFPN layers, and the number of channels per layer, scales up with the size of the backbone network. At the end are a few more convolution layers followed by classification and box prediction.
Since I used EfficientNet for my previous project, EfficientDet seemed a natural choice for this project. This came with a few challenges. When I started working, the only viable option for running EfficientDet with Colab was based on PyTorch. (Tensorflow has since released their own implementation). Up until now, all my work had been done in Tensorflow. Learning to work in a new environment is always a challenge, but I didn’t mind that. The real downside to working in PyTorch was that I couldn’t use a TPU on Colab. Having recourse only to the GPU, I was forced to scale down the number of different models I could compare. To date, I have only run D0 through D2, each with the default input size. Object detection usually uses larger images than simple classification, which means batch sizes must be smaller, and thus learning rates must be smaller. I could only use batch sizes of 4 for the models. Initial learning rate was set at 0.0002 and dropped by half after each epoch where the validation loss didn’t improve. I gave the model the option of training up to 100 epochs, but usually found I could stop by epoch 50.
Preprocessing
As with all image processing neural networks, Object Detection is very dependent on good image preprocessing. Most of the important preprocessing operations involve transformations, such as shifts and zooms, that affect the bounding box just as much as the image. Fortunately, there’s a package, albumentations, that can handle this. I’ve no idea what, if anything, albumentations are supposed to be (and it looks like the spell checker is just as stumped) but the package consists of functions that ensure any transformation applied to an image is also applied to the associated bounding box and/or image mask in the case of image segmentation problems.
NABirds images come in a variety of sizes, with an upper bound on height and width of 1024. About half have one side of length 1024 and the other somewhere between 600 and 900. Batch processing requires a fixed input size. This is easy enough for training data, since I take a random square crop out of the image anyway, but is a bit trickier for validation. Since I don’t want to lose any information by cropping the validation images, I instead rescale so that the longer dimension matches the input size and then pad the smaller dimension until the image in is a square.

The set of transforms I use on the training data is listed below.
The first operation takes a random rectangular crop from the original image ranging in size between the full image and 10%, with an aspect ratio between 4:3 and 3:4, and reshapes it to a square of the appropriate size. I always use the default input size of whichever EfficientDet network I’m using.
Stage two changes the hue or brightness, but not both, with 90% probability (i.e. one time out of ten it does nothing). Stage three changes the image to greyscale 1% of the time. The next three operations are flips along the vertical, horizontal, and diagonal axes, each with 50% probability. Cutout will replace some small square in the image with zeros; it functions like dropout on later layers. A sample of training images produced by these transformations is below.

Eagle-eyed readers will notice that all of these images contain bounding boxes, although in one case the box is now the full image. What happens if the random crop does not contain the box at all? Very simply, I try again. The preprocessing code makes as many as 100 attempts to produce a crop with a bounding box before falling back on a simpler default, which merely performs the resize operation on the whole image.
My validation transforms are below.
All this does is pad the image with zeros until I get a square, and then resizes to fit the model. The end result of its handiwork is below.

Training
PyTorch does not have the Keras frontend available to Tensorflow, which means training a model requires a bit more manual work. The key idea is still fairly simple. For each batch of training data forward propagate to get current predictions and then back propagate to update the parameters. The full Fitter
class I used has too much boilerplate code for this article, but I can share the loop that trains one epoch.
After each training epoch, a similar loop performs validation. If the new parameters are an improvement, the model is saved.
Validation
Object detection validation is not a simple matter. Unlike image classification, which produces one probability for each category, object detection produces a separate probability and bounding box for each category and anchor box in the output. That’s a lot of numbers. Most of which can can be easily discarded simply by thresholding, i.e. if the score for a certain category in a certain anchor is low enough, we can ditch it without a second thought. In practice, that will be the case for all categories in most of the anchor boxes and most categories in all of the anchor boxes. That still leaves the possibility of multiple bounding boxes marked as positive IDs. If the positive categories don’t match the real categories then there is an obvious misclassification, but if the categories match, it’s necessary to consider the accuracy of the bounding box as well.
As described above, object detection models rely on anchor boxes. After the model is trained, each one of these anchors has an output. Most can be ignored simply because of their low confidence score. It’s also possible that two (or more) neighboring anchors will produce positive output for the same class. In this case, the bounding boxes will be very similar and only one needs to be saved, the one with the highest confidence score. Non-max suppression is the standard algorithm for removing redundant bounding boxes.
Only now can I get down to the business of comparing the model’s output to the ground truth. I have something of an advantage here knowing that I have only one bounding box per image; I can evaluate by only considering the predicted bounding box with the highest confidence. The metrics used for measuring accuracy deal specifically with precision and recall so I’ll have to define true positive, false positive, and false negative for object detection.
- True Positive – Predicted category matches ground truth and IoU is greater than some threshold.
- False Positive – Predicted category matches ground truth but the IoU is below the threshold.
- False Negative – Predicted category does not match ground truth or no category is predicted.
Precision is the ratio of the true positives to all predicted positives and recall is the ratio of true positives to the ground truth results. In layman’s terms, precision is the percentage of true predictions that are accurate and recall is the percentage of positive classes identified.
Object detection models are evaluated by calculating precision at various IoU thresholds (0.5 and 0.75 are popular) and by averaging the precision (the seemingly redundantly named mean average precision) over a range of IoU thresholds (usually 0.5 to 0.95 in increments of 0.05.) Similar metrics can be calculated for recall.
Single Class Detection
The simpler of my two problems is simply to find where in the image there is a bird and returns the box containing it. I tried three different models, D0 through D2 in the EfficientDet family, using the bounding box with the highest confidence score as the prediction. I only allow confidence scores above 0.5; a few images sufficiently confuse the models that no bounding box meets that threshold, so there are a few false negatives.
False positives are more interesting. As a metric, IoU has to wear a couple different hats. It says that the predicted box must cover a certain amount of the ground truth box, but also that the ground truth box must cover a certain amount of the predicted box. A low IoU may mean that the model detected a bird in the wrong part of the image, but may also mean that the model found the bird but wasn’t precise enough (i.e. the box was too big.) More often than not, a false positive will be a case where the bird was detected, and will be in the bounding box, but the predicted coordinates were not good enough.
Intuitively, an IoU score of 0.5 might seem a little low. It’s worth looking at some examples to see what different IoU thresholds look like. Remember, images are in two dimensions. Our minds have trouble recognizing when area or volume doubles when more than one dimension is changing at a time. Let’s look at what IoU=0.8 amounts to in real life.

Anyone looking at these images would probably guess that the overlap is better than 90%, but in reality its a little under 80% in all cases. This suggests that 0.8, which would be considered a mediocre value for a lot of metrics, is not that bad for bounding boxes. It’s also worth noting that the predicted box is often better than the "ground truth." Human annotators have their limits.
Even cases where the IoU is a measly 0.5 often look okay to these human eyes.

Both of these examples used predictions from EfficientDetD0. As I increase from D0 to D2, the accuracy improves modestly.
Using an IoU of 0.5 as the threshold, the models have near perfect precision. Only in a few cases does the predicted bounding box not overlap the ground truth adequately. Even at the higher threshold of 0.75 prediction and ground truth match at least nine times out of ten. The mean average precision, which averages precision at IoU threshold between 0.5 and 0.95, is a bit lower since precision above IoU=0.9 drops precipetously.
Mean average recall is also nearly perfect since only a couple hundred images do not have any bounding boxes with confidence score at least 0.5.
Looking just at the metrics, the models appear to be performing well. As the model size increases accuracy increases modestly but consistently. Looking at the output, I think the models might even be doing better than the numbers suggest. In a number of cases, the predicted bounding boxes are clearly better than the human labeled ground truth. In others, the predicted boxes are too large simply because the model assumes the bird extends behind a branch, a common enough situation. In one comical case, the predicted bounding box does not overlap the ground truth because it catches the bird’s reflection (in retrospect, allowing vertical flips in training may have been unwise). In another, the predicted box finds a second bird the annotator missed. I think that if the ground truth labels had been a little cleaner, the models’ performance would look even better.

Multi Class Detection
There are 22 top level category in Cornell’s data set, each corresponding to and order or family of birds and each with common names provided. Adding category detection complicates the classification (the what) component of an object detection model, but has less effect on the separate localization (the where) component. Since I’ve increased the number of positive classes from 1 to 22, I need to drop the threshold at which I consider a prediction score to be a positive match. I’m still dropping everything except the highest score (one bounding box per image) so I can take the relatively low score of 0.1 as a positive prediction.
How did I pick 0.1? Random guessing would give me just under 0.05; I doubled that, figuring that confidence scores twice random wouldn’t happen very often by chance. The results were favorable, so I stuck with it.
It should come as no surprise that accuracy varies among the classes more or less in direct proportion to the number of images in the class. Since I made no attempt to equalize the data in training, the six classes with double digit image totals got two positive IDs in three models between them. Meanwhile, perching birds is at 97–98%. One gratifying result is that in the middle range, orders with between 300 and 1000 images, increasing the model size could have a dramatic impact. The true positive rate could jump 10–20% (more in the case of frigatebirds). Full true positive results are below.
The standard procedure in multi-class object detection problems is to compute average precision by averaging the precision over all classes. Given the level of class imbalance, I thought it would be instructive to compute an "unweighted" average precision over the whole data set. This will, of course, give me larger numbers than the standard weighted precision. I think both data points have some value.
Remember that a false positive is when the predicted order matches the ground truth, but the IoU of the predicted and ground truth boxes is too low.
Precision is lower than it is for the single class detectors, as expected, and increases slowly as the model increases, also as expected. The weighted (i.e. standard) average precision is much lower, but also increases at a faster rate than the unweighted metric. This reflects the fact that accuracy jumps dramatically for the mid-sized classes. The distribution of IoU values is pretty similar for both sets of models; almost all the decrease in accuracy is a reflection of the model now having to distinguish between 22 orders.
Further Work
Any good experiment opens up the door to new questions. In this case, I can think of several. Assuming I can keep the batch size high enough (i.e. find a way to use TPUs) how would increasing the model size beyond D2 affect accuracy? I had made one experiment with D3, but was forced to cut the already small batch size in half. The end result was a lower recall and precision scores falling back to D0 levels. As I’d made no other changes, it seems the batch size was entirely responsible for that.
I’ve long suspected that larger models need lower learning rates. Would D3 and higher work better with a lower learning rate?
What happens if, instead of 22 categories, I use all 404 species, or even all 555 classes? Everything I’ve seen about object detection suggests that it does not comfortably handle as many classes as image classification. Not yet, anyway.
Perhaps the most useful test of the above algorithms would be to test them on some other set of bird data, such as the birds in the various iNaturalist data sets.
References
[1] Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Be- longie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In CVPR, 2015.
[2] Mingxing Tan, Ruoming Pang, and Quoc V. Le. EfficientDet: Scalable and Efficient Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.