(Image by author)

Adventures in PyTorch

Image Classification of bird species using deep learning with PyTorch, Captum and ONNX

Part 0: Introduction to image classification, deep learning and the CalTech Birds 200 dataset

Ed Morris
11 min readJul 31, 2020

--

This series will explore the power of Facebook AI Research’s (FAIR) powerful neural network and machine learning architecture, PyTorch. In this series of articles, we will explore the power of PyTorch in application to an image classification problem, to identify 200 species of North American bird using the CalTech 200 birds dataset, by using various CNN architectures including GoogLeNet, ResNet152 and ResNeXt101, among others.

CalTech UCSD Birds 200 Dataset (Welinder P., Branson S., Mita T., Wah C., Schroff F., Belongie S., Perona, P. “Caltech-UCSD Birds 200”. California Institute of Technology. CNS-TR-2010–001. 2010)

Introduction

In this set of articles, I will explore how we can use Facebook AI Research’s neural network library PyTorch for the purposes of solving an image classification problem of many classes. This extensive library, which comes with a broad array of existing tools packaged with the distribution, range from high-level abstractions using torch.nn module (the PyTorch equivalent of Keras), to low-level autograd functions and efficient GPU based operations, allows for the design and fast implementation of state-of-the-art machine learning architectures. At its most basic level, PyTorch can be considered a highly optimized, multi-dimensional version of the well-known python array package Numpy.

Figure 1: CalTech Birds classification workshop work flow. Each article in this series will focus on a separate stage of this workflow. The notebooks used to generate these articles, along with the trained models can be found in this Github repository. (Image by author)

The workflow that will be undertaken is illustrated in Figure 1, and each article will focus on a particular aspect of this workflow, including:

  1. Data exploration and introduction to classification problems.
  2. Preparing an image classification Convolutional Neural Network (CNN) and train on the following architectures:
    A) Torchvision pre-trained networks.
    B) 3rd party pre-trained networks.
  3. Assessment of classification model performance using traditional methods (classification reports, metrics, confusion matrix, precision recall curves etc).
  4. Assessment of network classification performance using domain reduction and manifold techniques(e.g. t-SNE and UMAP) of neural network activation maps.
  5. Transfer learning, using a trained neural network as a feature extractor as input into other classification algorithms (XGBoost, Kernel SVM).
  6. Feature visualisations of neural network convolutional filters (neuron, spatial and layer activations), using the Lucent python package (Lucid for PyTorch).
  7. Deployment of a trained model through the conversion of the PyTorch model object to ONNX (Open Neural Network eXchange) format, and demonstration of inference using the ONNXruntime environment.

In this introduction to the series, we are going to review what is classification from a machine learning perspective, what do we mean by image classification and in particular what is image classification using deep learning techniques. Following that, we will introduce the fine-grained image classification problem of bird species classification, and review the chosen dataset, which we will use to later to demonstrate these techniques. Finally, we will look at a summary of the results obtained with state-of-the-art convolutional neural networks developed over past few years, which in the following series of articles, we will uncover how to achieve these results.

What is a predictive model and what is classification?

Figure 2: Machine learning algorithms fit into two general fields, classification and regression. (Image by author)

Predictive modeling is the problem of developing a model using historical data to make a prediction on new data where we do not have the answer. Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output variables (y). This is called the problem of function approximation.

The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.

Fundamentally, the difference between classification and regression is that one aims to predict a label or category, whilst the other aims to predict a quantity.

Figure 2 clearly illustrates the difference between these two approaches, where the classification model predicts the class value for a given data point on the basis of its location with respect to the class boundary, for example a Support Vector Machine (SVM) classifier. The regression model however, is designed to predicted the actual quantity of something, given a set of predictors (columns of data).

Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). The output variables are often called labels or categories. The mapping function predicts the class or category for a given observation. For example, an email of text can be classified as belonging to one of two classes: “spam and “not spam“.

  • A classification problem requires that examples be classified into one of two or more classes.
  • A classification can have real-valued or discrete input variables.
  • A problem with two classes is often called a two-class or binary classification problem.
  • A problem with more than two classes is often called a multi-class classification problem.
  • A problem where an example is assigned multiple classes is called a multi-label classification problem.

It is common for classification models to predict a continuous value as the probability of a given example belonging to each output class. The probabilities can be interpreted as the likelihood or confidence of a given example belonging to each class. A predicted probability can be converted into a class value by selecting the class label that has the highest probability.

In these articles it is classification that we are going to be focusing on, and a particular subset of a problem called image classification, something we will go into more detail in the next section.

What is image classification?

Figure 3: image classification of objects, where the image is classified on the basis of the contents of the image. (Image by author)

Image classification is a sub domain of the general problem of classification. An algorithm designed for image classification accepts images as its input, and produces a prediction of the class of the image as output. The output can take the form of a label or category, or a set of real-valued probabilities representing the likelihood of each potential class belonging to the image. The cartoon in Figure 3 shows an algorithm that has been designed to identify whether the an image of food is an ice cream, a ice lolly or a cake, taking the image of the food as input, and delivering a class label as a prediction.

In order to successfully classify an image, an algorithm needs to extract representative “features” from the input images, and learn the respective patterns that allows it to differentiate between classes based on the contents of the image (Figure 4A). These “features” can be manually created through an iterative and interactive design process, with the aim of producing features that best separate the classes. However, this manual approach has the following potential drawbacks:

  1. Human intensive.
  2. Prone to error.
  3. Exhaustive process.
  4. No guarantee of unique features which optimally distinguish between classes.
Figure 4A: differing approaches to image classification using (top) machine learning with manually crafted features and (bottom) deep learning using CNNs to automatically derive both idealized features and a classifier in one process. (Image by author)

Another approach that has been developed extensively over the past 10 years, and seen considerable improvement in image classification performance, is the use of deep learning, and in particular, Convolutional Neural Networks (CNN). These types of approaches attempt to solve both the problem of optimally designed feature extractions and a classifier that predicts object class, at the same time. In its simplest form, a CNN can be thought of as a huge, automatically derived (or learnt) feature extraction system, with a classifier tagged on the end.

Figure 4B: The concept of training from scratch (top) versus Transfer Learning (bottom). In Transfer Learning, the pre-trained network is used as a feature extraction network, by removing the final layer classifier and using the extracted feature outputs as input into a new classifier (which does not necessarily have to be neural network based). (Image by author)

In fact, it is this very structure that allows for the methodology of so-called “Transfer Learning” (Figure 4B). This is the process of removing the classification layer at the base of the network, and using the network as a feature extractor. The outputs of this feature extraction network can then be fed into another classifier, the choice of which can be any machine learning classifier (e.g. SVM, Kernel SVM, decision tree based ensembles like XGBoost), to perform class predictions. In this case, instead of the inputs to the classifier being the images, they are the set of extracted features (sometimes termed activations) from the image, that the classifier learns to recognize and predict the image class from.

It is these approaches that we will show how to build a state-of-the-art of image classifier using PyTorch in the following article series.

The Dataset — The CalTech UCSD 200 Birds Database (CUB-200–2011)

Figure 5: Example birds of similar species (warblers) from the CUB-200–2011, showing the complexity and difficulty of bird species identification. Even with the species diagrams shown in Figure 6, it is difficult even for humans to predict the specific bird species.

The CUB-200–2011 dataset contains images of North American birds from a range of 200 different species. It is a challenging problem as many of the species of birds have degree of visual similarity. Bird species identification can be challenging for humans, let alone computer vision algorithms, hence this type of problem is often referred to as Large Scale Fine-Grained. For example, identifying the correct species of warbler of the two birds in Figure 5, with the species drawings in Figure 6, is still a very hard and complex problem for humans.

Figure 6: Example drawings of North American warbler species (by Kate Dolamore Art).

The dataset was originally produced in 2010 (CUB-200), and contained ~6000 images of the 200 classes of birds. Accompanying this was additional label data including bounding boxes, rough segmentations and additional attributes. This was updated in 2011 (CUB-200–2011), to add additional images, bringing the total number of images in the dataset to almost 12,000. The available attributes were also updated to include 15 part locations, 312 binary attributes and a bounding box per image (figure 7). For a majority of this series, we will simply be using the images and class labels to developed and train networks for predicting the bird class.

Selected Published Results on CUB-200–2011

Wah et al (2010) reported using RGB color histograms and histograms of vector-quantized SIFT descriptors with a linear SVM, they obtained a classification accuracy of 17.3%. This was produced as a basis for comparison with more advanced techniques, including deep learning approaches, and is an example of the method of using hand-crafted features with machine learning to solve the image classification problem (as shown in the top diagram of figure 4).

Using the full scale un-cropped images, they achieved an overall mean classification accuracy of 10.3%. In 2014, Goring et al presented an approach for fine-grained recognition based on a non-parametric label transfer technique which transfers part constellations from objects with similar global shapes, achieving a 57.8% mean class accuracy.

Figure 8: Top performing models on CUB-200–2011 dataset, from paperswithcode.com

More recent studies into fine-grained image classification have been largely based on convolutional neural network approaches. Considerably better accuracy has been achieved by these approaches, with mean class accuracy approaching 90% (figure 8; graph of best performing algorithms on the CUB-200–2011 from the paperswithcode.com). For example, Cui et al (2017) demonstrate the use of modern neural network architectures, to obtain significantly more accurate predictions. A key part of the success of their approach was the use of deep learning networks, combined with high resolution image training and dealing with the long tailed distribution aspects of fine-grained problems. They also conclusively show the advantages of using domain specific pre-training datasets to increase the accuracy on smaller, fine-grained classification problems such as bird species identification.

What is possible with modern tools?

With minimal effort, what really is possible I hear you say?
In this set of articles, the aim is to show how relatively simple it is to access state-of-the-art image classification models, and obtain performance close to the current leading networks. As this discussion how shown, this kind of performance on fine-grained image classification problems has only been possible for the past few years.

Before diving into the detail of exactly how to setup, train and deploy a PyTorch CNN for image classification, let’s look at the results that we can obtain using a relatively straightforward approach. In the coming articles, I will share with you more details of the underlying python code that has been developed to train these models, but here I want to share the final results of all the different network architectures that I trained on the dataset.

Figure 9: PyTorch CNN image classification architectures performance comparison using class macro average metrics. Evaluated on a held out test set of the CUB-200–2011 dataset, after pre-training on ImageNet, and further training using CUB-200–2011. (Image by author)

Figure 9 shows the performance of a number of different model architectures, all Convolutional Neural Networks (CNN) for image classification, trained on the CUB-200–2011. These models range from one of the first successful model architectures to exploit deeper networks, GoogLeNet, proposed by Szegedy et al (2014) from Google, in the paper “Going Deeper with Convolutions”, to more recent network architectures including both ResNeXt (after Xie et al, 2017) and PNAS Net (after Liu et al, 2018), which both achieved close to 85% average class accuracy in bird species identification.

This example shows that, even in the space of 3 to 4 years of development, the improvement in performance of these networks is considerable. It is worth explicitly noting here that the only difference between these models is the network architecture itself. The selection of the training and test images, along with the image augmentation process, and the assessment of the models performance, is identical for all models concerned. It is simply the differing network architectures that have allowed the latest networks to achieve significant improvements in classification performance.

Join me over the set of forthcoming articles, where we will discover how to use python and PyTorch to build a state-of-the-art bird species classifier to produce the results illustrated above, as well as methods to understand how it is performing, and how it makes decisions.

The code and notebooks that were used to produce these set of articles have been published on Github and can be found here. The data is available through links on the Github page.

References:

  1. Szegedy, C. et al. Going Deeper with Convolutions. arXiv:1409.4842 [cs] (2014).
  2. Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated Residual Transformations for Deep Neural Networks. arXiv:1611.05431 [cs] (2017).
  3. Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path Aggregation Network for Instance Segmentation. arXiv:1803.01534 [cs] (2018).
  4. Goering, C., Rodner, E., Freytag, A. & Denzler, J. Nonparametric Part Transfer for Fine-Grained Recognition. in 2014 IEEE Conference on Computer Vision and Pattern Recognition 2489–2496 (IEEE, 2014). doi:10.1109/CVPR.2014.319.

Background reading:

For background reference material on neural networks, including in theory and in practice, the interested reader is referred to the following excellent resources.

Practical guides for deep learning and neural networks:

  1. Rosebrock, A. Deep Learning For Computer Vision With Python. vol. 1–3 (2017).
  2. Chollet, F. Deep learning with Python. (Manning Publications Co, 2018).

Background theory of neural networks and deep learning:

  1. Aggarwal, C. C. Neural networks and deep learning: a textbook. (Springer, 2018).
  2. Bishop, C. M. Pattern recognition and machine learning. (Springer, 2006).
  3. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning. (MIT Press, 2016).
  4. Hagan, M. T., Demuth, H. B., Beale, M. H. & De Jesus, O. Neural Network Design (2nd Edition). (Self published, 2014).

General machine learning background theory and practice:

  1. Aggarwal, C. C. Data Classification: Algorithms and Applications. (Chapman & Hall/CRC, 2015).
  2. Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning. (Springer, 2017).
  3. Kuhn, M. & Johnson, K. Applied Predictive Modeling. (Springer New York, 2013). doi:10.1007/978–1–4614–6849–3.
  4. Efron, B. & Hastie, T. Computer Age Statistical Inference: Algorithms, Evidence and Data Science. (Cambridge University Press, 2017).
  5. James, G., Witten, D., Hastie, T. & Tibshirani, R. An introduction to statistical learning: with applications in R. (Springer, 2013).

--

--

I am a data scientist with 15 years experience working with machine learning and optimisation theory on large datasets in multiple industry sectors.