Bird by Bird Tech

Bird by Bird using Deep Learning

Advancing CNN model for fine-grained classification using transfer learning, auxiliary task and attention

Sofya Lipnitskaya

Published in

Towards Data Science

17 min readJan 3, 2021

This article demonstrates how deep learning models used for image-related tasks can be advanced in order to address the fine-grained classification problem. For this objective, we will walk through the following two parts. First, you will get familiar with some basic concepts of computer vision and convolutional neural networks, while the second part demonstrates how to apply this knowledge to a real-world problem of bird species classification using PyTorch. Specifically, you will learn how to build your own CNN model – ResNet-50, – to further improve its performance using transfer learning, auxiliary task and attention-enhanced architecture, and even a little more.

Introducing the related work

Applications of deep learning for computer vision

Computers perform extremely well when it comes to crunching numbers. Solving tons of equations to get a human to the Moon? No problem. Determine whether a cat or a dog appears in an image? Oops… The task that is inherently easy for any human being seemed to be impossible for first computers. During the years, algorithms evolved as well as the hardware did (remember the Moor’s law? R.I.P.). The field of computer vision appeared as a trial to solve the task of classifying images using computers. After the long period of development, many sophisticated methods were created. However, all of them suffered from the lack of generalizability: a model built to classify cats vs. dogs couldn’t distinguish, for example, birds.

Design principles of convolutional neural networks

In 1989, Yann LeCun and his colleagues had proposed [1], and further developed [2] the concept of convolutional neural network (CNN). The model itself was inspired by a human visual cortex, where a visual neuron is responsible for a small piece of a picture that is visible to an eye – the neuron’s receptive field. Structurally, it was expressed in the way that a single convolutional neuron (filter) scanned an input image step-by-step, being applied to different parts of the image many times, which refers to a concept of weight sharing (Figure 1).

Figure 1. LeCun’s LeNet-5, convolutional neural network model, composed of convolution and sub-sampling operations followed by the fully-connected layers that process the data extracted by the previous layers to form the final output (Adapted from [2])

Towards residual networks

Of course, since LeCun’s LeNet-5, the state-of-the-art of CNN models has been developed greatly. The first successful large-scale architecture came out with AlexNet [3] that won the ILSVRC 2012 challenge achieving the top-5 error rate of 15.3%. Later advancements gave many powerful models that were mainly improved throughout the usage of larger and more complex architectures. The thing is, as the network goes deeper (depth is increasing), its performance gets saturated and starts degrading. To address this problem, the residual neural network (ResNet) was developed [4] to effectively direct the input over some layers (also known as skip- or residual connections).

Figure 2. Skip connection block of ResNets

The core idea of the ResNet architecture is to pass a part of a signal to the end of a convolutional block unprocessed (by just copying values) in order to enlarge gradient flow through the deep layers (Figure 2). Thus, the skip connection guarantees that performance of the model does not decrease but it could increase slightly.

The next part explains how the discussed theory can be actually applied for solving the real-world problem.

Classification of bird species using ResNet

Bird species recognition is a difficult task challenging the visual abilities for both human experts and computers. One of the interesting datasets related to the fine-grained classification problem is Caltech-UCSD Birds-200-2011 (CUB-200-2011) [5] consisting of 11788 images of birds belonging to 200 species. To address this problem, the goals of the current tutorial will be: (a) to build a CNN model to classify bird images w.r.t. their species and (b) to determine how the prediction accuracy of a baseline model can be boosted using CNNs of different architectures. For that, we will use PyTorch, one of the most popular open-source frameworks for deep learning.

By the end of this tutorial, you will be able to:

Understand basics of image classification problem of bird species.
Determine the data-driven image pre-processing strategy.
Create your own deep learning pipeline for image classification.
Build, train and evaluate ResNet-50 model to predict bird species.
Improve the model performance by using different techniques.

First, you need to download an archive containing the dataset and store it into the data directory. It can be done manually from the following link, or using the Python code provided in the following GitHub repository: https://github.com/slipnitskaya/Bird-by-Bird-AI-Tutorials.

Now, let’s import packages that we will use in this tutorial:

# import packages
import os
import csvimport numpy as np
import sklearn.model_selection as skmsimport torch
import torch.utils.data as td
import torch.nn.functional as Fimport torchvision as tv
import torchvision.transforms.functional as TF# define constants
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
RANDOM_SEED = 42IN_DIR_DATA = 'data/CUB_200_2011'
IN_DIR_IMG = os.path.join(IN_DIR_DATA, 'images')

Discovering the optimal data transformation strategy

In this tutorial, we plan to pre-train a baseline model using the ImageNet dataset. As pre-trained models usually expect input images to be normalized in the same way, heights and widths should be at least of size 224 x 224 pixels. There might many ways for the image transformation be used to fullfill above specifications, but what might be the optimal one?

Exploratory data analysis is an essential starting point of any data science project, which lays the foundation for the further analysis. Since we are interested to define the optimal data transformation strategy, we are going to explore bird images to see what useful we can grasp on. Let’s have a look at some bird examples of the sparrow family (Figure 3). Seems like there can be a high similarity among birds related to different species which is really hard to spot. Is that a White-throated or a Lincoln Sparrow? Well, even experts can be confused…

Figure 3. Hard-to-distinguish birds of the same family: White-throated Sparrow vs. Lincoln Sparrow (CUB-200‑2011), vs. Captain Jack Sparrow (Public domain)

Just out of interest, we’ll sum up all classes of the Sparrow family to understand how many of them are there in our dataset:

# calculate the number of sparrow species
cls_sparrows = [k for k in os.listdir(IN_DIR_IMG) if 'sparrow' in k.lower()]
print(len(cls_sparrows))

The code above gives us the value of 21, implying that there are dozen different species can be represented only by a single family. And now we see why CUB-200-2011 is perfectly designed for fine-grained classification. What do we have is the many similar birds potentially related to different classes, and we, actually, plan to deal with that problem here.

But before getting in a real deep learning, we want to determine an appropriate strategy for data pre-processing. For that, we will analyse the marginal distributions of width sand heights by visualizing box plots for the corresponding observations:

Figure 4. Variable distribution of sizes of bird images. Box reflect the interquartile range of the image size distribution (widths and heights) with a median quartile – line within the box – marking the mid-point of the data, whiskers representing observations greater and lesser of Q1 and Q3, respectively, as well as the outside points denoting the data that lie outside the overall distribution

Indeed, the size of images varies considerably. We also see that heights and widths of the majority images are equal to 375 and 500 pixels, respectively. So, what might be the appropriate transformation strategy for this kind of data?

Transforming images and splitting the data

CUB-200-2011 dataset contains thousands of images, so it might affect the computational time. To overcome that we first create class DatasetBirds to make data loading and pre-processing easy:

class DatasetBirds(tv.datasets.ImageFolder):
    """
    Wrapper for the CUB-200-2011 dataset. 
    Method DatasetBirds.__getitem__() returns tuple of image and its corresponding label.    
    """
    def __init__(self,
                 root,
                 transform=None,
                 target_transform=None,
                 loader=tv.datasets.folder.default_loader,
                 is_valid_file=None,
                 train=True,
                 bboxes=False):        img_root = os.path.join(root, 'images')        super(DatasetBirds, self).__init__(
            root=img_root,
            transform=None,
            target_transform=None,
            loader=loader,
            is_valid_file=is_valid_file,
        )        self.transform_ = transform
        self.target_transform_ = target_transform
        self.train = train
        
        # obtain sample ids filtered by split
        path_to_splits = os.path.join(root, 'train_test_split.txt')
        indices_to_use = list()
        with open(path_to_splits, 'r') as in_file:
            for line in in_file:
                idx, use_train = line.strip('\n').split(' ', 2)
                if bool(int(use_train)) == self.train:
                    indices_to_use.append(int(idx))        # obtain filenames of images
        path_to_index = os.path.join(root, 'images.txt')
        filenames_to_use = set()
        with open(path_to_index, 'r') as in_file:
            for line in in_file:
                idx, fn = line.strip('\n').split(' ', 2)
                if int(idx) in indices_to_use:
                    filenames_to_use.add(fn)        img_paths_cut = {'/'.join(img_path.rsplit('/', 2)[-2:]): idx for idx, (img_path, lb) in enumerate(self.imgs)}
        imgs_to_use = [self.imgs[img_paths_cut[fn]] for fn in filenames_to_use]        _, targets_to_use = list(zip(*imgs_to_use))        self.imgs = self.samples = imgs_to_use
        self.targets = targets_to_use        if bboxes:
            # get coordinates of a bounding box
            path_to_bboxes = os.path.join(root, 'bounding_boxes.txt')
            bounding_boxes = list()
            with open(path_to_bboxes, 'r') as in_file:
                for line in in_file:
                    idx, x, y, w, h = map(lambda x: float(x), line.strip('\n').split(' '))
                    if int(idx) in indices_to_use:
                        bounding_boxes.append((x, y, w, h))            self.bboxes = bounding_boxes
                else:
            self.bboxes = None    def __getitem__(self, index):
        # generate one sample
        sample, target = super(DatasetBirds, self).__getitem__(index)        if self.bboxes is not None:
            # squeeze coordinates of the bounding box to range [0, 1]
            width, height = sample.width, sample.height
            x, y, w, h = self.bboxes[index]            scale_resize = 500 / width
            scale_resize_crop = scale_resize * (375 / 500)            x_rel = scale_resize_crop * x / 375
            y_rel = scale_resize_crop * y / 375
            w_rel = scale_resize_crop * w / 375
            h_rel = scale_resize_crop * h / 375            target = torch.tensor([target, x_rel, y_rel, w_rel, h_rel])        if self.transform_ is not None:
            sample = self.transform_(sample)
        if self.target_transform_ is not None:
            target = self.target_transform_(target)    return sample, target

All pre-trained models expect input images to be normalized in the same way, such as the height and width are at least 224 pixels. As you might noticed from our previous analysis, the size of the data varies considerably, and many images have landscape layout rather than portrait one, and width is commonly close to the maximum value along both dimensions.

In order to improve the ability of the model to learn bird representation, we’ll use data augmentation. We want to transform images in a such way, so we maintain the aspect ratio. One solution is to scale images uniformly, so that both dimensions are equal to the larger side using the maximum padding strategy. For that, we’ll create a pad function to pad images to 500 pixels:

def pad(img, size_max=500):
    """
    Pads images to the specified size (height x width). 
    """
    pad_height = max(0, size_max - img.height)
    pad_width = max(0, size_max - img.width)
    
    pad_top = pad_height // 2
    pad_bottom = pad_height - pad_top
    pad_left = pad_width // 2
    pad_right = pad_width - pad_left
    
    return TF.pad(
        img,
        (pad_left, pad_top, pad_right, pad_bottom),
        fill=tuple(map(lambda x: int(round(x * 256)), (0.485, 0.456, 0.406))))

Assuming birds to appear at any image part, we make the model able to capture them everywhere by randomly-cropping and flipping images along both axes during the model training. While the images of the test split will be center-cropped before feeding into ResNet-50, as we expect the majority birds to be located at this image part referring to the previous data exploration.

For that, we are going to crop images by 375 x 375 pixels along both dimensions, as that is the average size of the majority images. We’ll also normalize images by mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225] to make distribution of pixel values closer to the Gaussian one.

# transform images
transforms_train = tv.transforms.Compose([
   tv.transforms.Lambda(pad),
   tv.transforms.RandomOrder([
       tv.transforms.RandomCrop((375, 375)),
       tv.transforms.RandomHorizontalFlip(),
       tv.transforms.RandomVerticalFlip()
   ]),
   tv.transforms.ToTensor(),
   tv.transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
transforms_eval = tv.transforms.Compose([
   tv.transforms.Lambda(pad),
   tv.transforms.CenterCrop((375, 375)),
   tv.transforms.ToTensor(),
   tv.transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

Then, we’ll organize images of the CUB-200-2011 dataset into three subsets to insure the proper model training and evaluation. As authors of the dataset suggest the way to assemble the training and test subsets, we split our data accordingly. Additionally, the validation split will be defined to further fine-tune the parameters of the model during the model evaluation process. For that, the training subset will be split using stratified sampling technique that ensures that each subset have equally balanced classes of different species.

# instantiate dataset objects according to the pre-defined splits
ds_train = DatasetBirds(IN_DIR_DATA, transform=transforms_train, train=True)
ds_val = DatasetBirds(IN_DIR_DATA, transform=transforms_eval, train=True)
ds_test = DatasetBirds(IN_DIR_DATA, transform=transforms_eval, train=False)splits = skms.StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=RANDOM_SEED)
idx_train, idx_val = next(splits.split(np.zeros(len(ds_train)), ds_train.targets))

We’ll set up parameters for data loading and model training. To leverage computations and be able to proceed large dataset in parallel, we will collate input samples in several mini-batches and also denote how many sub-processes to use to generate them in order to leverage the training process.

# set hyper-parameters
params = {'batch_size': 24, 'num_workers': 8}
num_epochs = 100
num_classes = 200

After we’ll create a DataLoader object to yield samples of an each data split:

# instantiate data loaders
train_loader = td.DataLoader(
   dataset=ds_train,
   sampler=td.SubsetRandomSampler(idx_train),
   **params
)
val_loader = td.DataLoader(
   dataset=ds_val,
   sampler=td.SubsetRandomSampler(idx_val),
   **params
)
test_loader = td.DataLoader(dataset=ds_test, **params)

Building a baseline ResNet‑50 classifier

We are going to use ResNet-50 model for classification of bird species. ResNet (or Residual Network) is a variant of convolutional neural networks that was proposed as a solution to the vanishing gradient problem of large networks.

PyTorch provides the ResNet-50 model on torchvision.models, so we will instantiate the respective class and set the argument num_classes to 200 given the dataset of that number of bird species:

# instantiate the model
model = tv.models.resnet50(num_classes=num_classes).to(DEVICE)

More specifically, the chosen architecture is 50 layers deep and composed of 5 stages, 4 of which with residual blocks and 1 comprise a convolution, batch normalization and ReLU operations.

Training and evaluating the model

Next point is to set the learning rate of our model as well as a schedule to adjust it during the training for the sake of the better performance. Training of the ResNet-50 model will be done using the Adam optimizer with an initial learning rate of 1e-3 and an exponentially decreasing learning rate schedule such as it drops by a factor of gamma at each epoch.

# instantiate optimizer and scheduler
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95)

Finally, we are ready to train and validate our model to recognize and learn the difference between bird species. The cross-entropy loss and accuracy metric will be accumulated per epoch in order to inspect the model performance dynamics. Following all of the training experiments, we test the model using the subset of previously unseen data to assess the overall goodness in bird classification using the accuracy metric.

# loop over epochs
for epoch in range(num_epochs):# train the model
    model.train()
    train_loss = list()
    train_acc = list()
    for batch in train_loader:
        x, y = batch
        
        x = x.to(DEVICE)
        y = y.to(DEVICE)
        
        optimizer.zero_grad()        # predict bird species
        y_pred = model(x)        # calculate the loss
        loss = F.cross_entropy(y_pred, y)        # backprop & update weights
        loss.backward()
        optimizer.step()        # calculate the accuracy
        acc = skms.accuracy_score([val.item() for val in y], [val.item() for val in y_pred.argmax(dim=-1)])
        
        train_loss.append(loss.item())
        train_acc.append(acc)
                
    # validate the model
    model.eval()
    val_loss = list()
    val_acc = list()
    with torch.no_grad():
        for batch in val_loader:
            x, y = batch            x = x.to(DEVICE)
            y = y.to(DEVICE)            # predict bird species
            y_pred = model(x)
            
            # calculate the loss
            loss = F.cross_entropy(y_pred, y)            # calculate the accuracy
            acc = skms.accuracy_score([val.item() for val in y], [val.item() for val in y_pred.argmax(dim=-1)])        val_loss.append(loss.item())
        val_acc.append(acc)
    # adjust the learning rate
    scheduler.step()# test the model
true = list()
pred = list()
with torch.no_grad():
    for batch in test_loader:
        x, y = batch        x = x.to(DEVICE)
        y = y.to(DEVICE)        y_pred = model(x)        true.extend([val.item() for val in y])
        pred.extend([val.item() for val in y_pred.argmax(dim=-1)])# calculate the accuracy 
test_accuracy = skms.accuracy_score(true, pred)
print('Test accuracy: {:.3f}'.format(test_accuracy)

Figure 5 depicts the model performance metrics for ResNet-50:

Figure 5. Cross-entropy loss and accuracy metric against the number of epochs of the baseline ResNet‑50

As we see, the baseline model performs really poor as it overfits. The one of main reasons is the lack of diverse training samples. Just a quick note: CUB-200-2011 dataset has ~30 images per specie. Seems like we are stuck…isn’t it? Actually, there are some ways we can address to overcome these issues.

Advancing the deep learning model

Well, we ran into a number of challenges in our previous analysis, so we may start thinking about how we can address these follow-up questions:

Question 1: How to deal with overfitting given the limited amount of training samples?
Question 2: How to improve the model performance in bird species recognition?

Let’s figure out how we can advance our baseline model in more detail.

How to deal with overfitting given the limited amount of training samples?

As it was said before, deep neural networks require a lot of training samples. Practitioners have noticed that, in order to train a deep neural network from scratch, the amount of data should grow exponentially with the number of trainable parameters. Luckily, generalization ability of a model that was trained on a larger dataset can be transferred to another, usually, simpler task.

In order to improve the performance of thebaseline model for bird classification, we will use weight initialization obtained from the general-purpose model pre-trained on the ImageNet dataset, and further fine-tune its parameters using the CUB-200-2011 one. The training process remains the same, while the model will rather focus on the fine-tuning of hyper-parameters.

PyTorch provides pre-trained models in torch.utils.model_zoo. Construction of a pre-trained ResNet-50 can be done by passing pretrained=True into constructor. This simple trick provides us with the model that already has well initialized filters, so there is no need to learn them from scratch.

# instantiate the model
model = tv.models.resnet50(num_classes=200, pretrained=True).to(DEVICE)

We will also set a lower learning rate of 1e-4 in the optimizer, as we are going to train a network that was yet pre-trained on a large-scale image-classification task. And here are results:

Figure 6. Cross-entropy loss and accuracy metric against the number of epochs of the pre-trained ResNet‑50

As we see, the use of the pre-trained model allows to solve the overfitting problem giving 80.77% test accuracy. Let’s continue experimenting on that!

How to improve the model performance in bird species recognition?

Solution 1: Multi-task learning

Now we can extend this approach even more. Why do we have to increase the complexity of a single task if we can add another one? No reason at all. It was noticed that introduction of an additional – auxiliary – task improves the network’s performance forcing it to learn more general representation of the training data.

As Caltech-UCSD Birds-200–2011 dataset includes bounding boxes in addition to class labels, we will use this auxiliary target to make the network to train in a multi-task fashion. Now, we will predict 4 coordinates of bird’s bounding box in addition to its specie by setting num_classes to 204:

# instantiate the pre-trained model
model = tv.models.resnet50(num_classes=204, pretrained=True).to(DEVICE)

Now we need to slightly modify our training and validation blocks, as we want to make predictions and calculate the loss for two targets corresponding to a correct bird specie and its bounding box coordinates. Here’s an example execution:

...y_pred = model(x)# predict bird species
y_pred_cls = y_pred[..., :-4]
y_cls = y[..., 0].long()
# predict bounding box coordinates
y_pred_bbox = y_pred[..., -4:]
y_bbox = y[..., 1:]# calculate the loss
loss_cls = F.cross_entropy(y_pred_cls, y_cls)
loss_bbox = F.mse_loss(torch.sigmoid(y_pred_bbox), y_bbox)
loss = loss_cls + loss_bbox...

Figure 7. Cross-entropy loss and accuracy metric against the number of epochs of the pre-trained ResNet‑50 enhanced with the auxiliary task

Results are even better – integration of the auxiliary task provides the stable increase of accuracy points giving 81.2% on the test split – as shown in Figure 7.

Solution 2: Attention-enhanced CNNs

In the last few paragraphs we were focused on the data-driven advancement of our model. However, at some point the complexity of the task can exceed the model’s capacity resulting in a lower performance. In order to adjust the model’s power to the difficulty of the problem, we can equip the network with additional attention blocks that will help it to focus on important parts of the input and ignore irrelevant ones.

class Attention(torch.nn.Module):
    """
    Attention block for CNN model.
    """
    def __init__(self, in_channels, out_channels, kernel_size, padding):
        super(Attention, self).__init__()        self.conv_depth = torch.nn.Conv2d(in_channels, out_channels, kernel_size, padding=padding, groups=in_channels)
        self.conv_point = torch.nn.Conv2d(out_channels, out_channels, kernel_size=(1, 1))
        self.bn = torch.nn.BatchNorm2d(out_channels, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True)
        self.activation = torch.nn.Tanh()    def forward(self, inputs):
        x, output_size = inputs
        x = F.adaptive_max_pool2d(x, output_size=output_size)
        x = self.conv_depth(x)
        x = self.conv_point(x)
        x = self.bn(x)
        x = self.activation(x) + 1.0return x

Attention module allows to highlight relevant regions of feature maps and returns values varying in range [0.0, 2.0], where the lower value implies the lower priority of a given pixel for the following layers. So we’ll create and instantiate the class ResNet50Attention corresponding to the attention-enhanced ResNet-50 model:

class ResNet50Attention(torch.nn.Module):
    """
    Attention-enhanced ResNet-50 model.
    """    weights_loader = staticmethod(tv.models.resnet50)    def __init__(self, num_classes=200, pretrained=True, use_attention=True):
        super(ResNet50Attention, self).__init__()        net = self.weights_loader(pretrained=pretrained)
        self.num_classes = num_classes
        self.pretrained = pretrained
        self.use_attention = use_attention        net.fc = torch.nn.Linear(
            in_features=net.fc.in_features,
            out_features=num_classes,
            bias=net.fc.bias is not None
        )        self.net = net        if self.use_attention:
            self.att1 = Attention(in_channels=64, out_channels=64,     kernel_size=(3, 5), padding=(1, 2))
            self.att2 = Attention(in_channels=64, out_channels=128, kernel_size=(5, 3), padding=(2, 1))
            self.att3 = Attention(in_channels=128, out_channels=256, kernel_size=(3, 5), padding=(1, 2))
            self.att4 = Attention(in_channels=256, out_channels=512, kernel_size=(5, 3), padding=(2, 1))            if pretrained:
                self.att1.bn.weight.data.zero_()
                self.att1.bn.bias.data.zero_()
                self.att2.bn.weight.data.zero_()
                self.att2.bn.bias.data.zero_()
                self.att3.bn.weight.data.zero_()
                self.att3.bn.bias.data.zero_()
                self.att4.bn.weight.data.zero_()
                self.att4.bn.bias.data.zero_()    def _forward(self, x):
        return self.net(x)
    
    def _forward_att(self, x):
        x = self.net.conv1(x)
        x = self.net.bn1(x)
        x = self.net.relu(x)
        x = self.net.maxpool(x)        x_a = x.clone()
        x = self.net.layer1(x)
        x = x * self.att1((x_a, x.shape[-2:]))        x_a = x.clone()
        x = self.net.layer2(x)
        x = x * self.att2((x_a, x.shape[-2:]))        x_a = x.clone()
        x = self.net.layer3(x)
        x = x * self.att3((x_a, x.shape[-2:]))        x_a = x.clone()
        x = self.net.layer4(x)
        x = x * self.att4((x_a, x.shape[-2:]))        x = self.net.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.net.fc(x)        return x
    
    def forward(self, x):
        return self._forward_att(x) if self.use_attention else    self._forward(x)
# instantiate the model
model = ResNet50Attention(num_classes=204, pretrained=True, use_attention=True).to(DEVICE)

After that, we are ready to train and evaluate the performance of the attention-enhanced model pre-trained on the ImageNet dataset and advanced with the multi-task learning for bird classification using the same code we utilized before. Final accuracy score has been increased to 82.4%!

Figure 8 shows summary results generated during the analysis:

Figure 8. Comparison of the performance of ResNet‑50 advanced using different techniques

Results clearly indicate that the final variant of the ResNet-50 model advanced with transfer and multi-task learning, as well as with the attention module, greatly contributes to the more accurate bird predictions.

Conclusions

Here, we used different approaches to improve the performance of a baseline ResNet-50 for the classification of bird species from CUB-200–2011 dataset. What could we learn from that? Here are some take-home messages from our analysis:

Data exploration results indicate the CUB-200–2011 as the high-quality, balanced although center-biased dataset without corrupted images.
In case of the limited amount of training samples, you can reuse weights of the model pre-trained on another dataset in your own model.
Learning through auxiliary task in addition to the primary bird classification one contributes to the better model performance.
Enhancing the network’s architecture by adding new layers (attention module) makes the model more accurate in bird species classification.
Analysis of different extensions of the basic ResNet-50 indicate the pre-trained model advanced using auxiliary task and attention mechanism as the prominent candidate for the further investigations.

In summary, there is a space for improvements of the model performance. Additional advancements can be achieved by further optimization of model hyper-parameters, the use of a stronger data augmentation, regularization, meta-learning techniques.

More coming soon!

The focus of the next tutorials will be on finite automata simulation and interpretability of deep learning models for leveraging AI-assisted systems. Interested to keep it on? Subscribe and stay updated on more deep learning materials at – https://medium.com/@slipnitskaya.

References

LeCun, Yann, et al. “Backpropagation applied to handwritten zip code recognition.” Neural computation 1.4 (1989): 541–551.
LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278–2324.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Communications of the ACM 60.6 (2017): 84–90.
He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition (2016): 770–778.
Wah, Catherine, et al. “The Caltech-UCSD Birds 200–2011 dataset.” Computation & Neural Systems Technical Report, CNS-TR-2011–001.(2011).